Introduction

Our principal goal

Our main goal is to evaluate different models and choose the best within them to determine if new applicants represent a good or bad credit risk.

Under this context, we decided to use the methology named “Cross-industry standard process for data mining” (CRISP-DM). This model consists in six phases that naturally describes the data science life cycle. Below, you will find a picture that describe this process.

Figure 1: CRISP-DM Process

CRISP-DM Process

The data

For this project, we will use the data GermanCredit.csv provided in the course Projects in Data Analytics for Decision Making given by Professor Jacques Zuber, which contains 1’000 observations about credit applicants, described by 30 variables.

data<-read.csv2(here::here("data/GermanCredit.csv"), dec=".", header=T)

Our questions

In order to make a better analysis, we ask ourselves some questions that we will try to solve through the EDA and the applied models. We will seek to answer to these questions in the conclusions section:

  1. Are there any variables that could be grouped?
  2. Have we used all the original independent variables of the model?
  3. Is the data balanced regarding the answer variable?
  4. Does it make sense to balance the data to avoid the model being biased?
  5. Accuracy, sensivity or specificity, which we need to be more focused on?
  6. Which model fits the data the best?

Business understanding

The goal of this phase is to understand the project objetives and what is needed to be able to achieve them, then translate them into a data mining problem definition, which includes a designed plan for the analysis and the aplication we will follow step by step.

Determine business objectives

In general, the banking business have two main goals:

  • The first, to offer variety of products or services that answers to individual and business customers needs;
  • The second, and the one most important for this project, to collect payments from the products provided to their clients with the goal of generating incomes for shareholders.

Both objectives generate an organic balance operation between the customers needs and the gains that the company requires for its operations. However, in this analysis, we will focus on dimishinig the risk linked to the second goal. In other words, we are looking for a good model to help us forecast which client will have a higher risk of not being able to pay back a loan that has been granted to them.

Our main goal will be to try to minimize the losses that are given by the sum of the amounts of credit that are given to the people that are predicted to be positive (hence being eligible for a credit) but that should have actually be forecasted as negative, as they determine a risk for the company of not being able to pay back the amount they received.

We want to achieve the goal of having the losses being smaller than the 10% of the total amount of the credit that would be granted to the customers.

Goal: Losses < 10% amount of credit

We will determine it by considering that the company will grant a credit only to those who have a good credit score, and it will not otherwise.

Figure 2: Goal of a credit

Goal of a credit

Assess situation

Now we will take into consideration some assumptions and make the list of the requirements and constraints that the project could have.

Assess List
Assumptions 1) The team members have all the required skills.
2) The data is real.
Requirements 1) Boundaries of the work: identify the best model among at least 5
2) Submit a report with our findings.
Constraints 1) We have about 7 weeks to complete the analysis.
2) The size of the data set.
3) Limited input variables

Determine data mining goals

The proposed mining goal for this analysis is to obtain a model which shows a good evaluation in terms of risk of failure to refund, given the information of a new person, and hence hepling to decide whether it is a good idea to grant them the credit they are requiring. This algorithm needs to have a high accuracy, highlighting the negative impact of a false positive.

The specific goal for the different part of our analysis are the following:

  1. Classification: Group variables that bring similar information. Potential creation of dummy variables and eliminate the variable which did not bring enough information.
  2. Prediction: Find the model that gives us the best predictions comparing them with the testing data.
  3. Optimization: Maximization of the sensitivity of the data.

Produce project plan

In order to meet the objectives we have made a gantt chart, the time span considered was divided into 6 weeks from November 2 to December 18.

Figure 3: Gantt Project

Gantt Project

We have followed the deadlines in order to obtain the corresponding feedback for each week.

Data Understanding

Going forward with the anaylisis, the goal in the second step is to have a first perception of the information brought by the data and create hypotheses about them.

Collect initial data

The dataset was delivered together with the description of the task and it is in csv format. It contains 1’000 observations, 1 each row, 30 input variables and 1 output variable. In addition to the dataset, we seek for more information in videos in youtube where we were mainly intended to familiarize with the operation of credit selection itself and, hence, indeep more into the variables that we consider that are bringing the most information.

Describe data

In this point, we will examine the gross properties of the adquire data. Let’s start checking the stucture and size of it. As you can see below, there are 1’000 rows and 32 variables. The first column identifies all the observations taken into consideration with an unique ID (a number), the 30 following columns are the input variables, while the last one is the output variable, which gives the information regarding the person is a high risk (the credit is rejected) or not (the credit is accepted).

dim(data)
## [1] 1000   32

Now, let’s give a look to the summary and the structure of the data, including their statistical characteristics, for example, minimum, mean, maximum, so on.

Overview of the dataset (1000 observations):

Summary

Output variable: yes / no

As we can see from the graph, there is a majority of the observations having a positive value (700 against 300).

Description by output variable

## 
##  Descriptive statistics by group 
## group: 0
##                  vars   n    mean      sd median trimmed     mad min   max
## OBS.                1 300  515.76  281.03  542.0  518.66  349.15   2   999
## CHK_ACCT            2 300    0.90    1.05    1.0    0.75    1.48   0     3
## DURATION            3 300   24.86   13.28   24.0   23.59   17.79   6    72
## HISTORY             4 300    2.17    1.08    2.0    2.19    0.00   0     4
## NEW_CAR             5 300    0.30    0.46    0.0    0.25    0.00   0     1
## USED_CAR            6 300    0.06    0.23    0.0    0.00    0.00   0     1
## FURNITURE           7 300    0.19    0.40    0.0    0.12    0.00   0     1
## RADIO.TV            8 300    0.21    0.41    0.0    0.13    0.00   0     1
## EDUCATION           9 300    0.07    0.26    0.0    0.00    0.00   0     1
## RETRAINING         10 300    0.11    0.32    0.0    0.02    0.00   0     1
## AMOUNT             11 300 3938.13 3535.82 2574.5 3291.18 2092.69 433 18424
## SAV_ACCT           12 300    0.67    1.30    0.0    0.34    0.00   0     4
## EMPLOYMENT         13 300    2.17    1.22    2.0    2.18    1.48   0     4
## INSTALL_RATE       14 300    3.10    1.09    4.0    3.25    0.00   1     4
## MALE_DIV           15 300    0.07    0.25    0.0    0.00    0.00   0     1
## MALE_SINGLE        16 300    0.49    0.50    0.0    0.48    0.00   0     1
## MALE_MAR_or_WID    17 300    0.08    0.28    0.0    0.00    0.00   0     1
## CO.APPLICANT       18 300    0.06    0.24    0.0    0.00    0.00   0     1
## GUARANTOR          19 300    0.03    0.18    0.0    0.00    0.00   0     1
## PRESENT_RESIDENT   20 300    2.85    1.09    3.0    2.94    1.48   1     4
## REAL_ESTATE        21 300    0.20    0.40    0.0    0.12    0.00   0     1
## PROP_UNKN_NONE     22 300    0.22    0.42    0.0    0.15    0.00   0     1
## AGE                23 300   33.96   11.22   31.0   32.38    8.90  19    74
## OTHER_INSTALL      24 300    0.25    0.44    0.0    0.19    0.00   0     1
## RENT               25 300    0.23    0.42    0.0    0.17    0.00   0     1
## OWN_RES            26 300    0.62    0.49    1.0    0.65    0.00   0     1
## NUM_CREDITS        27 300    1.37    0.56    1.0    1.29    0.00   1     4
## JOB                28 300    1.94    0.67    2.0    1.95    0.00   0     3
## NUM_DEPENDENTS     29 300    1.15    0.36    1.0    1.07    0.00   1     2
## TELEPHONE          30 300    0.38    0.49    0.0    0.35    0.00   0     1
## FOREIGN            31 300    0.01    0.11    0.0    0.00    0.00   0     1
## RESPONSE           32 300    0.00    0.00    0.0    0.00    0.00   0     0
##                  range  skew kurtosis     se
## OBS.               997 -0.08    -1.13  16.23
## CHK_ACCT             3  0.99    -0.27   0.06
## DURATION            66  0.83     0.03   0.77
## HISTORY              4  0.07    -0.09   0.06
## NEW_CAR              1  0.89    -1.22   0.03
## USED_CAR             1  3.82    12.60   0.01
## FURNITURE            1  1.55     0.39   0.02
## RADIO.TV             1  1.44     0.08   0.02
## EDUCATION            1  3.26     8.64   0.02
## RETRAINING           1  2.43     3.91   0.02
## AMOUNT           17991  1.57     2.05 204.14
## SAV_ACCT             4  1.83     1.82   0.08
## EMPLOYMENT           4  0.12    -0.96   0.07
## INSTALL_RATE         3 -0.72    -0.97   0.06
## MALE_DIV             1  3.46     9.98   0.01
## MALE_SINGLE          1  0.05    -2.00   0.03
## MALE_MAR_or_WID      1  3.00     7.02   0.02
## CO.APPLICANT         1  3.69    11.63   0.01
## GUARANTOR            1  5.17    24.85   0.01
## PRESENT_RESIDENT     3 -0.25    -1.40   0.06
## REAL_ESTATE          1  1.49     0.23   0.02
## PROP_UNKN_NONE       1  1.32    -0.25   0.02
## AGE                 55  1.14     0.73   0.65
## OTHER_INSTALL        1  1.13    -0.73   0.03
## RENT                 1  1.25    -0.43   0.02
## OWN_RES              1 -0.49    -1.76   0.03
## NUM_CREDITS          3  1.45     2.34   0.03
## JOB                  3 -0.40     0.44   0.04
## NUM_DEPENDENTS       1  1.91     1.67   0.02
## TELEPHONE            1  0.51    -1.75   0.03
## FOREIGN              1  8.44    69.53   0.01
## RESPONSE             0   NaN      NaN   0.00
## ------------------------------------------------------------ 
## group: 1
##                  vars   n    mean      sd median trimmed     mad min   max
## OBS.                1 700  493.96  292.05  482.5  492.61  377.32   1  1000
## CHK_ACCT            2 700    1.87    1.23    2.0    1.96    1.48   0     3
## DURATION            3 700   19.21   11.08   18.0   17.88    8.90   4    60
## HISTORY             4 700    2.71    1.04    2.0    2.73    0.00   0     4
## NEW_CAR             5 700    0.21    0.41    0.0    0.13    0.00   0     1
## USED_CAR            6 700    0.12    0.33    0.0    0.03    0.00   0     1
## FURNITURE           7 700    0.18    0.38    0.0    0.09    0.00   0     1
## RADIO.TV            8 700    0.31    0.46    0.0    0.26    0.00   0     1
## EDUCATION           9 700    0.04    0.20    0.0    0.00    0.00  -1     1
## RETRAINING         10 700    0.09    0.29    0.0    0.00    0.00   0     1
## AMOUNT             11 700 2985.46 2401.47 2244.0 2564.20 1485.57 250 15857
## SAV_ACCT           12 700    1.29    1.65    0.0    1.11    0.00   0     4
## EMPLOYMENT         13 700    2.48    1.19    2.0    2.54    1.48   0     4
## INSTALL_RATE       14 700    2.92    1.13    3.0    3.02    1.48   1     4
## MALE_DIV           15 700    0.04    0.20    0.0    0.00    0.00   0     1
## MALE_SINGLE        16 700    0.57    0.49    1.0    0.59    0.00   0     1
## MALE_MAR_or_WID    17 700    0.10    0.29    0.0    0.00    0.00   0     1
## CO.APPLICANT       18 700    0.03    0.18    0.0    0.00    0.00   0     1
## GUARANTOR          19 700    0.06    0.25    0.0    0.00    0.00   0     2
## PRESENT_RESIDENT   20 700    2.84    1.11    3.0    2.93    1.48   1     4
## REAL_ESTATE        21 700    0.32    0.47    0.0    0.27    0.00   0     1
## PROP_UNKN_NONE     22 700    0.12    0.33    0.0    0.03    0.00   0     1
## AGE                23 700   36.30   11.77   34.0   34.92   10.38  19   125
## OTHER_INSTALL      24 700    0.16    0.36    0.0    0.07    0.00   0     1
## RENT               25 700    0.16    0.36    0.0    0.07    0.00   0     1
## OWN_RES            26 700    0.75    0.43    1.0    0.82    0.00   0     1
## NUM_CREDITS        27 700    1.42    0.58    1.0    1.35    0.00   1     4
## JOB                28 700    1.89    0.65    2.0    1.89    0.00   0     3
## NUM_DEPENDENTS     29 700    1.16    0.36    1.0    1.07    0.00   1     2
## TELEPHONE          30 700    0.42    0.49    0.0    0.39    0.00   0     1
## FOREIGN            31 700    0.05    0.21    0.0    0.00    0.00   0     1
## RESPONSE           32 700    1.00    0.00    1.0    1.00    0.00   1     1
##                  range  skew kurtosis    se
## OBS.               999  0.04    -1.23 11.04
## CHK_ACCT             3 -0.39    -1.53  0.05
## DURATION            56  1.18     1.38  0.42
## HISTORY              4  0.00    -0.90  0.04
## NEW_CAR              1  1.44     0.08  0.02
## USED_CAR             1  2.29     3.26  0.01
## FURNITURE            1  1.70     0.89  0.01
## RADIO.TV             1  0.81    -1.34  0.02
## EDUCATION            2  4.31    20.27  0.01
## RETRAINING           1  2.86     6.18  0.01
## AMOUNT           15607  1.94     4.62 90.77
## SAV_ACCT             4  0.76    -1.16  0.06
## EMPLOYMENT           4 -0.22    -0.87  0.04
## INSTALL_RATE         3 -0.45    -1.29  0.04
## MALE_DIV             1  4.50    18.32  0.01
## MALE_SINGLE          1 -0.30    -1.91  0.02
## MALE_MAR_or_WID      1  2.74     5.53  0.01
## CO.APPLICANT         1  5.23    25.39  0.01
## GUARANTOR            2  3.93    14.88  0.01
## PRESENT_RESIDENT     3 -0.28    -1.38  0.04
## REAL_ESTATE          1  0.78    -1.39  0.02
## PROP_UNKN_NONE       1  2.27     3.17  0.01
## AGE                106  1.43     4.51  0.45
## OTHER_INSTALL        1  1.88     1.54  0.01
## RENT                 1  1.89     1.59  0.01
## OWN_RES              1 -1.17    -0.63  0.02
## NUM_CREDITS          3  1.20     1.30  0.02
## JOB                  3 -0.37     0.50  0.02
## NUM_DEPENDENTS       1  1.89     1.59  0.01
## TELEPHONE            1  0.34    -1.89  0.02
## FOREIGN              1  4.26    16.21  0.01
## RESPONSE             0   NaN      NaN  0.00

As we see in the table shown above, all the values are integers and there is not any missing value. In addition, we can see some inconsistencies of the variables with the initial description, that we will explain in following table.

Variable Description Inconsistencies
CHK_ACCT C: 0, 1, 2, 3 X
DURATION Numerical X
HISTORY C: 0, 1, 2, 3, 4 X
NEW_CAR B: 0, 1 X
USED_CAR B: 0, 1 X
FURNITURE B: 0, 1 X
RADIO.TV B: 0, 1 X
EDUCATION B: 0, 1 ✔: Acoording to the description we should have a binary variable and the data show -1
RETRAINING B: 0, 1 X
AMOUNT Numerical X
SAV_ACCT C: 0, 1, 2, 3, 4 X
EMPLOYMENT C: 0, 1, 2, 3, 4 X
INSTALL_RATE Numerical X
MALE_DIV B: 0, 1 X
MALE_SINGLE B: 0, 1 X
MALE_MAR_WID B: 0, 1 X
CO-APPLICANT B: 0, 1 X
GUARANTOR B: 0, 1 ✔: Acoording to the description we should have a binary variable and the data show a 2
PRESENT_RESIDENT C: 0, 1, 2, 3 ✔: Acoording to the description we should 3 categories instead of 4 shown by the data
REAL_ESTATE B: 0, 1 X
PROP_UNKN_NONE B: 0, 1 X
AGE Numerical ✔: Identify outliers, the age should not go up to 125 years
OTHER_INSTALL B: 0, 1 X
RENT B: 0, 1 X
OWN_RES B: 0, 1 X
NUM_CREDITS Numerical X
JOB C: 0, 1, 2, 3 X
NUM_DEPENDENTS Numerical X
TELEPHONE B: 0, 1 X
FOREIGN B: 0, 1 X

Then, for the 4 inconsistencies we have found, we have establish the following hypothesis and solutions.

  • EDUCATION: there was an error in the registration of the information and the -1 must be replaced by a 1.
  • GUARANTOR: there was an error in the registration of the information and the 2 must be replaced by a 1.
  • PRESENT_RESIDENT: there was an error in the registration of the information and the value was considered in years instead of categories.
  • AGE: Here it is clear that we have an outlier, we should limit the age to 75.

The corrections will be made in the next section.

It is also important to mention that the output variable shows that in 70% of the cases the credit is accepted and in 30% rejected, which can later bias the prediction.

Explore data

In this section, we will go deeper into the data and look for patterns or relationships between variables. To be able to do it, we will develop an histogram to check the distribution of our data for each variable.

Histogram

All variables

Of independent variables grouped by response

## NULL

Regarding the last charts, we have the following observations:

  • We can see that the numerical variables have a normal distribution asymmetric to the right (Skewed (Non-Normal) Right), possibly because we have a more significant lower bound.
  • It is very difficult to indentify patterns in the data.
  • In the histograms by response variable, we are able to see the proportion within the positives and negatives answers applications.

Boxplot

Now that we have checked the distribution of the variables, let’s move on to the evaluation of their quartiles.

All variables

Of independent variables grouped by response

In the following table, you will find our principal observations.

Boxplot Observation
For each variable We indentified that some variable could be mutually exclusive between them. We can evaluate the formation of the following groups:
1) Aggregation of the varibles purpose of the credit
2) Aggregation of the male variables as a categorical one.
3) Aggregation of the REAL_ESTATE and PROP_UNKN_NONE as a categorical one.
4) Aggregation of the RENT and OWN_RES as a categorical one.
plot by response The variables which stand out more are: CHK_ACCT and EMPLOYMENT. We can observe that each box by group is different from the others.

Additionally, we could identify some outliers, show as red dots and, as you can see in the second chart. The input variables clustered by the output variable show that there are some features that bring more information than others.

In the terms of the variables that we can create, we want to give some definitions, which are the following:

key definitions:

  1. Real property includes the physical property (physical land, structures and resources attached to it) of the real estate, but it expands its definition to include other types of ownerships as rights. Meaning that we can have properties without real state (PROP_UNKN_NONE =0 & REAL_ESTATE=0)
  2. For the RENT and OWN_RES variable, we can see that we have 3 cases: 1-0, 0-1 and 0-0 because there exists the possibility that the person who apply for the loan do not owns the residence and neither is in charge of any rent.

Correlation

Now, we will give a look to the correlation between variables.

The variables which are the most correlated are the following:

  • History and Number of credits positively correlated between them.
  • Duration and Amount positively correlated.
  • Response variable and check account positively correlated.

In the model section we will evaluate the coefficients of the variables and continue this analysis in greater depth.

Verify data quality

To be able to do so, we establish 3 questions that we will address during the resolution of this step:

  1. Is the data complete (does it cover all the cases required)?
  2. Is it correct or does it contain any error?
  3. Are there missing values in the data? If so how are they represented?

Overall dataset

By variable

## [1] 0
## [1] 1

## # A tibble: 1 x 4
##   type      cnt  pcnt col_name    
##   <chr>   <int> <dbl> <named list>
## 1 integer    31   100 <chr [31]>

Below you will find the answers of the 3 questions:

Question Answer
1 Yes, all the columns and rows contains information.
2 No, they contain some errors. They were found in data description and exploratory and are the following:
1) The variable EDUCATION shows an output that is not binary.
2) The variable PRESENT_RESIDENT have more categories than those that were mentioned in the description
3) the variable AGE is out of range.
4) AMOUNT, DURANTION, AGE are the variables with the highest quantity of outliers.
3 No, there is not any missing values in the data.

In addition, we are going to anaylise if the aggregation mentioned in the boxplot for each variable is possible. To do it, we are going to apply the Chi-Squared Test to measure the independence between them. Next, we are going to explain the steps we will follow for each analysis:

1) Variables: REAL_ESTATE and PROP_UNKN_NONE

First, we establish the hypotheses:

\(H_0\): The REAL_ESTATE and PROP_UNKN_NONE are independent variables.

against the bilateral alternative:

\(H_1\): They are not independent.

For the chi-squared test to be valid, the following conditions must be true:

  1. The sampling method is random
  2. The variables considered are categorical
  3. Size: all levels have more than 5 expected events.

Assuptions: Significance level of 0.05 Clarifications: The p-value is the probability that a chi-square statistic having X degrees of freedom is more extreme than \(X^2\).

Finally, we are going to accept or reject the hypotheses checking the p-value. If the p-value is less than the significance level that we have chosen, we reject the null hypothesis. Thus, we conclude that there is a relationship between the variables.

2) RENT and OWN_RES

First, we stablish the hypotheses:

\(H_0\): The RENT and OWN_RES are independent variables.

against the bilateral alternative:

\(H_1\): They are not independent.

Then, we move on with the same process that we mentioned above.

The analysis can be found in the next chapter.

Exploratory data analysis

Clean data

Task Output
Raise the data quality to the level required by the selected analysis techniques. This may involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by modeling. Describe what decisions and actions were taken to address the data quality problems reported during the verify data quality task of the data understanding phase. Transformations of the data for cleaning purposes and the possible impact on the analysis results should be considered.

Reconsider how to deal with observed type of noise

We will consider how to correct the inconsistencies we have found in the previous chapter, which are on four different variables, namely:

  • EDUCATION: there was an error in the registration of the information and the -1 must be replaced by a 1.
  • GUARANTOR: there was an error in the registration of the information and the 2 must be replaced by a 1.
  • PRESENT_RESIDENT: there was an error in the registration of the information and the value was considered in years instead of categories.
  • AGE: Here it is clear that we have an outlier, we should limit the age to 75.

We have already decided how to correct them, hence we will move on to that direction.

Correct, remove or ignore noise

We will start by correcting the noises of EDUCATION, GUARANTOR and PRESENT_RESIDENT, by simply replacing the value -1 and 2 with value 1 for the first two and by changing the numbers of the categories for the latter by dimishing each value by 1, this is going to give us the true value corresponding to the data description that was given to us.

#EDUCATION
data %<>% 
  mutate(EDUCATION = replace(EDUCATION, EDUCATION == -1, 1))

#EDUCATION
data %<>% 
  mutate(GUARANTOR = replace(GUARANTOR, GUARANTOR == 2, 1))

#PRESENT_RESIDENT 
data %<>% 
  mutate(PRESENT_RESIDENT = PRESENT_RESIDENT - 1)

Decide how to deal with special values and their meaning

This is specifically for the case of AGE. As previously said, we believe that the 125 age is an error, hence we will discard it by selecting only the observation with value lower then 76 (as 75 is the second highest value).

#AGE
data %<>% 
  filter(AGE < 76)

Construct data

Task Output
This task includes constructive data preparation operations such as the production of derived attributes, entire new records or transformed values for existing attributes. Derived attributes are new attributes that are constructed from one or more existing attributes in the same record. Examples:
area = length * width. Describe the creation of completely new records.
Example: create records for customers who made no purchase during the past year. There was no reason to have such records in the raw data, but for modeling purposes it might make sense to explicitly represent the fact that certain customers made zero purchases.

Check available constuction mechanisms

As we already mentioned in the previous chapter, we will be able to create 4 different variables: 1. A binary variable decribing the sex of the person (male vs. female) 2. A categorical variable for the purpose of the credit 3. A categorical variable describing the real estate situation of the person (i.e. if someone owns a residence) 4. A categorical variable describing the property situation of the person (i.e. it they own their residence, are renting or something else)

Sex variable

We will start by the variable describing the sex of the considered person: this variable will be created thanks to the MALE_DIV, MALE_SINGLE and MALE_MAR_WID variables, and it will be a binary taking value 1 if the person is male, and value 0 if they are female.

More specifically, if either one of the variables used to construct the new one has value 1, so will the SEX_MALE variable, otherwise it will have value 0.

data %<>% 
  mutate(SEX_MALE = ifelse((MALE_DIV | MALE_SINGLE | MALE_MAR_or_WID) == 1, 1, 0)) %>% 
  mutate(SEX_MALE = as.factor(SEX_MALE))

We will now explore a bit the new variable we have created, by looking at the number of instances for each category and how it is affected in terms of response variable.

#Respesentation of SEX_MALE per value
data %>% 
  ggplot(aes(SEX_MALE)) + 
  geom_bar(aes(fill = factor(SEX_MALE))) + 
  theme(legend.position = "none")  + 
  geom_label(stat = 'count', aes(label =..count..)) 

#Representation of output variable in terms of SEX_MALE
data %>% 
  ggplot(aes(RESPONSE)) + 
  geom_bar(aes(fill = factor(SEX_MALE)), position = "dodge")+ 
  labs(color = "", fill = "SEX_MALE", x = "RESPONSE", y = "count") 

We can see from the first graph that we have more observation with a positive value for the SEX_MALE variable (690 vs. 309), meaning that there are more men than women in the dataset.

Moreover, thanks to the second graph, we can see a difference on the positive value for the response having a male rather than a female, but this could also be due to the fact that the presence of male is higher with respect to female.

Collide variables

We will now move on to the other variables aforementioned, so that instead of having multiple dummy variables, we have factor variables with multiple levels.

Purpose

Let’s start with the purpose of credit.

This variable will take the following values: 1 = the purpose for the credit was a new car 2 = the purpose for the credit was a used car 3 = the purpose for the credit was funriture 4 = the purpose for the credit was a radio or a television 5 = the purpose for the credit was to increase the level of eductation 6 = the purpose for the credit was a retraining 0 = the purpose for the credit was something else

It will be created by taking the respective value each time the dummy corresponding to one purpose, hence if one of them takes value 1, so will its level, if none of them has value 1, then the PURPOSE will take value 0.

data %<>% 
  mutate(PURPOSE = ifelse(NEW_CAR == 1, 1, 
                          ifelse(USED_CAR == 1, 2, 
                                 ifelse(FURNITURE == 1, 3, 
                                        ifelse(RADIO.TV == 1, 4, 
                                               ifelse(EDUCATION == 1, 5, 
                                                      ifelse(RETRAINING == 1, 6, 0))))))) %>% 
  mutate(PURPOSE = as.factor(PURPOSE))

Let’s have a look at the new variable, in terms of number of observations per level and its link to the response variable.

data %>% 
  ggplot(aes(PURPOSE)) + 
  geom_bar(aes(reorder(PURPOSE, -table(PURPOSE)[PURPOSE]), fill = PURPOSE)) +
  scale_fill_discrete(name = "PURPOSE", 
                      labels = c("OTHER", "NEW_CAR", "USED_CAR", "FURNITURE", 
                                 "RADIO/TV", "EDUCATION", "RETRAINING")) + 
  geom_label(stat = 'count', aes(label =..count..)) 

data %>% 
  ggplot(aes(RESPONSE)) + 
  geom_bar(aes(fill = factor(PURPOSE)), position = "dodge") + 
  labs(x = "RESPONSE", y = "count") +  
  scale_fill_discrete(name = "PURPOSE", 
                      labels = c("OTHER", "NEW_CAR", "USED_CAR", "FURNITURE", 
                                 "RADIO/TV", "EDUCATION", "RETRAINING")) + 
  theme_bw()

In the first graph, we can see that the majority of the observations are found in the purpose of getting a Radio or a TV, followed by a new car and then furniture, while the one that is the less present is the education purpose.

In terms of output variable, shown in the second graph, the highest differences can be found in the Radio/TV and new car, but this could be given by the fact that they are the purposes with the highest number of observations.

Property

Now, we will create the property variable.

We will start by looking at if the two variables that we want to use (namely, REAL_ESTATE and PROP_UNKN_NONE) are connected and hence it makes sense to put them together.

In order to do so, as we previously mentioned, we will perform a chi-squared independence test.

chisq.test(data$REAL_ESTATE, data$PROP_UNKN_NONE)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$REAL_ESTATE and data$PROP_UNKN_NONE
## X-squared = 69.97, df = 1, p-value < 2.2e-16

We can see that the two variables are statistically significantly associated, as the p-value is really low, almost equal to 0, and hence is lower than the considered significance level of alpha = 5%.

We can conclude that it makes sense to merge the two variables into one factor, which will take value 1 if the person has a real estate, value 2 if the person is not known to have a property and value 0 otherwise.

data %<>% 
  mutate(PROPERTY = as.factor(ifelse(REAL_ESTATE == 1, 1, 
                                     ifelse(PROP_UNKN_NONE == 1, 2, 0))))

Let’s have a look also at this new variable, once again in terms of number of observations per level and if there is a difference of occurences given the output variable.

data %>% 
ggplot(aes(PROPERTY)) + geom_bar(aes(fill = PROPERTY)) + scale_fill_discrete(name = "PROPERTY", labels = c("OTHER", "REAL_ESTATE", "PROP_UNKN_NONE")) + geom_label(stat = 'count', aes(label =..count..))  

data %>% 
  ggplot(aes(RESPONSE)) + geom_bar(aes(fill = PROPERTY), position = "dodge") + scale_fill_discrete(name = "PROPERTY", labels = c("OTHER", "REAL_ESTATE", "PROP_UNKN_NONE"))

We can clearly see in the first graph that the majority of the observations do not have a clear value for the property, being equal to 0 (563 compared to the 282 of REAL_ESTATE and 154 of PROP_UNKN_NONE).

If we consider the response, hence the second graph, we cannot really see a difference from the person having a real estate or not having a property and having a credit rejected, while if they have a real estate it is more probable that they will get the credit compared to those who have not.

Residence

Now let’s look at the second variable that we wish to know if it is needed to be created.

This variable will be created using the RENT and OWN_RES variables to describe whether a person has a residence or not.

Let’s start once again by the chi-squared independence test.

chisq.test(data$RENT, data$OWN_RES)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$RENT and data$OWN_RES
## X-squared = 536.77, df = 1, p-value < 2.2e-16

Also here, we can conclude that the two variables are statistically significantly associated, as the p-value is really low, especially as it is lower than the significance level we have chosen of alpha being equal to 5%.

Hence, we will create the residence variable, which will take value 1 if the person is renting, value 2 if the person is owning their own residence and value 0 otherwise.

data %<>% 
  mutate(RESIDENCE = as.factor(ifelse(RENT == 1, 1, 
                                      ifelse(OWN_RES == 1, 2, 0))))

And let’s explore the new variable a little bit, in terms of number of observations per level and if there is a difference in the possibility to get a credit given this information.

data %>% 
  ggplot(aes(RESIDENCE)) + geom_bar(aes(fill = RESIDENCE)) + scale_fill_discrete(name = "RESIDENCE", labels = c("OTHER", "RENT", "OWN_RES"))+ geom_label(stat = 'count', aes(label =..count..)) 

data %>% 
  ggplot(aes(RESPONSE)) + geom_bar(aes(fill = RESIDENCE), position = "dodge") + scale_fill_discrete(name = "RESIDENCE", labels = c("OTHER", "RENT", "OWN_RES"))

In the first graph, we can see that the majority of the people in the sample do own their own residence (712 observations, compared to the 108 of other and 179 who are renting).

Looking at the second graph, comparing it to the response variable, we can see that owning the residence seems to have an impact on the possibility to get the credit, while renting seems not to have a major impact.

Integrate data

Task Output
These are methods whereby information is combined from multiple tables or records to create new records or values. Merging tables refers to joining together two or more tables that have different information about the same objects.
Merged data also covers aggregations. Aggregation refers to operations where new values are computed by summarizing together information from multiple records and/or tables.

Selecting variables we created and discard others

Here we integrate the variables we created in the dataset and we discard the ones we used to create them, so that we avoid the problem of multicollinearity.

We will need to drop also one of the variables we used to create the SEX_MALE variable, for the same reason. The choice is on MALE_DIV.

We will also drop the identifier variable (OBS.) as it is not needed in the modelling part.

data_sel <- data %>%
                dplyr::select(CHK_ACCT, DURATION, HISTORY, PURPOSE,
                       AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE,
                       SEX_MALE, MALE_SINGLE, MALE_MAR_or_WID,
                       CO.APPLICANT, GUARANTOR, PRESENT_RESIDENT,
                       PROPERTY, AGE, OTHER_INSTALL, RESIDENCE,
                       NUM_CREDITS, JOB, TELEPHONE, RESPONSE) 

Select data

Task Output
Decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality and technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table. List the data to be included/excluded and the reasons for these decisions.

To further select the data we will use the correlation and we will run a simple linear model to have a look at which are the most important variables to be selected.

We start with the correlation and we use the basic dataset, because we cannot run a correlation on factor variables.

##                            V1
## PRESENT_RESIDENT -0.003059919
## NUM_DEPENDENTS    0.003296525
## MALE_MAR_or_WID   0.019844152
## FURNITURE        -0.020669253
## JOB              -0.033889427
## TELEPHONE         0.035704280
## RETRAINING       -0.035923923
## NUM_CREDITS       0.046215841
## MALE_DIV         -0.049924304
## GUARANTOR         0.055206089
## CO.APPLICANT     -0.062607640
## EDUCATION        -0.069954175
## INSTALL_RATE     -0.073052339
## MALE_SINGLE       0.081465268
## AGE               0.089413005
## RENT             -0.092509400
## NEW_CAR          -0.098268291
## USED_CAR          0.100040026
## RADIO.TV          0.107374760
## OTHER_INSTALL    -0.113009082
## EMPLOYMENT        0.117550263
## REAL_ESTATE       0.119759431
## PROP_UNKN_NONE   -0.125508812
## OWN_RES           0.134228850
## AMOUNT           -0.154366015
## SAV_ACCT          0.178079352
## DURATION         -0.214326399
## HISTORY           0.229192869
## CHK_ACCT          0.352022485

We can see that, in general, the correlation between the output variable and the explanatory variable is not particularly high, having a maximum of 0.35 with CHCK_ACC and a minimum of -0.00306 with PRESENTE_RESIDENT.

We could decide to select only the variables having a correlation higher than a certain absolute value, however, as the difference among the correlations is not really high, and as we have used the basic dataset and not the one with the variables we have just created, we prefer not to make a selection here, and rather leave this decision to the modelling of a simple linear regression and a choice made on the AIC.

The Akaike information criterion (AIC) is a mathematical method for evaluating how well a model fits the data it was generated from. In statistics, AIC is used to compare different possible models and determine which one is the best fit for the data. source: https://www.scribbr.com/statistics/akaike-information-criterion/

The step function follows the idea that the variable that increases the AIC of the model the most will be discarder, up to the point in which it is not possible to decrease the AIC anymore.

Perform significance and correlation tests to decide what to include

set.seed(2143)
lm.sel <- glm(RESPONSE ~., data = data_sel)
lm.sel <- step(lm.sel, trace = 0)
summary(lm.sel) 
## 
## Call:
## glm(formula = RESPONSE ~ CHK_ACCT + DURATION + HISTORY + PURPOSE + 
##     AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_SINGLE + 
##     GUARANTOR + PROPERTY + OTHER_INSTALL + RESIDENCE + NUM_CREDITS + 
##     TELEPHONE, data = data_sel)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.05164  -0.31768   0.08993   0.28791   0.83553  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.944e-01  1.056e-01   6.576 7.88e-11 ***
## CHK_ACCT       9.306e-02  1.078e-02   8.632  < 2e-16 ***
## DURATION      -5.070e-03  1.453e-03  -3.489 0.000506 ***
## HISTORY        6.796e-02  1.363e-02   4.984 7.35e-07 ***
## PURPOSE1      -1.213e-01  6.016e-02  -2.017 0.043987 *  
## PURPOSE2       1.018e-01  6.854e-02   1.485 0.137931    
## PURPOSE3      -1.316e-02  6.218e-02  -0.212 0.832417    
## PURPOSE4       2.282e-03  5.959e-02   0.038 0.969465    
## PURPOSE5      -1.558e-01  7.902e-02  -1.972 0.048920 *  
## PURPOSE6      -1.599e-02  6.837e-02  -0.234 0.815149    
## AMOUNT        -1.719e-05  6.730e-06  -2.554 0.010801 *  
## SAV_ACCT       3.303e-02  8.344e-03   3.959 8.08e-05 ***
## EMPLOYMENT     2.125e-02  1.106e-02   1.920 0.055087 .  
## INSTALL_RATE  -4.665e-02  1.279e-02  -3.648 0.000278 ***
## MALE_SINGLE    7.415e-02  2.757e-02   2.690 0.007267 ** 
## GUARANTOR      1.724e-01  5.859e-02   2.943 0.003330 ** 
## PROPERTY1      4.181e-02  3.058e-02   1.367 0.171816    
## PROPERTY2     -9.679e-02  5.763e-02  -1.680 0.093348 .  
## OTHER_INSTALL -8.740e-02  3.315e-02  -2.637 0.008506 ** 
## RESIDENCE1    -1.280e-01  6.941e-02  -1.844 0.065457 .  
## RESIDENCE2    -5.565e-02  6.651e-02  -0.837 0.402923    
## NUM_CREDITS   -4.231e-02  2.489e-02  -1.700 0.089473 .  
## TELEPHONE      4.960e-02  2.734e-02   1.814 0.069942 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1577717)
## 
##     Null deviance: 209.91  on 998  degrees of freedom
## Residual deviance: 153.99  on 976  degrees of freedom
## AIC: 1015
## 
## Number of Fisher Scoring iterations: 2
data_sel <- lm.sel$model

Thanks to the AIC, we select the following variables: CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS and TELEPHONE, as they are the most significant. It is interesting to note that there are some levels of purpose that seem to be less relevant, more specifically the only one that are statistically significant are the first and the fifth. Moreover, we can see that there is coherence with the variables that had the highest correlations that we calculated before, hence we will use this method to make our final selection on the data.

Format data

Task Output
Formatting transformations refer to primarily syntactic modifications made to the data that do not change its meaning, but might be required by the modeling tool. Some tools have requirements on the order of the attributes, such as the first field being a unique identifier for each record or the last field being the outcome field the model is to predict. It might be important to change the order of the records in the dataset. Perhaps the modeling tool requires that the records be sorted according to the value of the outcome attribute. Additionally, there are purely syntactic changes made to satisfy the requirements of the specific modeling tool.

We will change the variables to factors for the dummies and the categorical variables, to have them corresponding to the description that has been given to us.

data_sel %<>% 
  mutate(
    CHK_ACCT = as.factor(CHK_ACCT),
    HISTORY = as.factor(HISTORY),
    SAV_ACCT = as.factor(SAV_ACCT),
    EMPLOYMENT = as.factor(EMPLOYMENT),
    MALE_SINGLE = as.factor(MALE_SINGLE), 
    GUARANTOR = as.factor(GUARANTOR),
    OTHER_INSTALL = as.factor(OTHER_INSTALL),
    TELEPHONE = as.factor(TELEPHONE),
    RESPONSE = as.factor(RESPONSE)
  )


str(data_sel)
## 'data.frame':    999 obs. of  16 variables:
##  $ RESPONSE     : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 1 ...
##  $ CHK_ACCT     : Factor w/ 4 levels "0","1","2","3": 1 2 4 1 1 4 4 2 4 2 ...
##  $ DURATION     : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ HISTORY      : Factor w/ 5 levels "0","1","2","3",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ PURPOSE      : Factor w/ 7 levels "0","1","2","3",..: 5 5 6 4 2 6 4 3 5 2 ...
##  $ AMOUNT       : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ SAV_ACCT     : Factor w/ 5 levels "0","1","2","3",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ EMPLOYMENT   : Factor w/ 5 levels "0","1","2","3",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ INSTALL_RATE : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ MALE_SINGLE  : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 1 1 ...
##  $ GUARANTOR    : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 1 ...
##  $ PROPERTY     : Factor w/ 3 levels "0","1","2": 2 2 2 1 3 3 1 1 2 1 ...
##  $ OTHER_INSTALL: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RESIDENCE    : Factor w/ 3 levels "0","1","2": 3 3 3 1 1 1 3 2 3 3 ...
##  $ NUM_CREDITS  : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ TELEPHONE    : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 1 ...
##  - attr(*, "terms")=Classes 'terms', 'formula'  language RESPONSE ~ CHK_ACCT + DURATION + HISTORY + PURPOSE + AMOUNT + SAV_ACCT +      EMPLOYMENT + INSTALL_RATE + MALE_SI| __truncated__ ...
##   .. ..- attr(*, "variables")= language list(RESPONSE, CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT,      EMPLOYMENT, INSTALL_RATE, MALE_SINGLE| __truncated__ ...
##   .. ..- attr(*, "factors")= int [1:16, 1:15] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:16] "RESPONSE" "CHK_ACCT" "DURATION" "HISTORY" ...
##   .. .. .. ..$ : chr [1:15] "CHK_ACCT" "DURATION" "HISTORY" "PURPOSE" ...
##   .. ..- attr(*, "term.labels")= chr [1:15] "CHK_ACCT" "DURATION" "HISTORY" "PURPOSE" ...
##   .. ..- attr(*, "order")= int [1:15] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(RESPONSE, CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT,      EMPLOYMENT, INSTALL_RATE, MALE_SINGLE| __truncated__ ...
##   .. ..- attr(*, "dataClasses")= Named chr [1:16] "numeric" "numeric" "numeric" "numeric" ...
##   .. .. ..- attr(*, "names")= chr [1:16] "RESPONSE" "CHK_ACCT" "DURATION" "HISTORY" ...

The selected dataset will hence have 999 observations of 16 different variables, 15 of which are the independent variables, 4 of which are continuous variable (i.e.: DURATION,AMOUNT,INSTALL_RATE and NUM_CREDITS) and the remaining are all categorical or dummy variables. The first variable is the output (i.e. RESPONSE), which is also a dummy.

We are now ready to move on with the modelling part of our analysis.

Model

Select modeling technique

The modelling technique that we will be using are the following:

Model Definition
1 Logistic
regression
> Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).
(https://en.wikipedia.org/wiki/Logistic_regression)
2 Decision
trees
> A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
(https://en.wikipedia.org/wiki/Decision_tree)
3 Discriminate
analysis
> Discriminant analysis is statistical technique used to classify observations into non-overlapping groups, based on scores on one or more quantitative predictor variables.
(https://stattrek.com/multiple-regression/discriminant-analysis.aspx)
4 Random
forest
> Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees.
(https://en.wikipedia.org/wiki/Random_forest)
5 Neural
network
> A neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes.
(https://en.wikipedia.org/wiki/Neural_network)
6 XGBoost > XGBoost is an implementation of gradient boosted decision trees designed for speed and performance.
(https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/)

In order to compare the 6 models shown above, we will mainly use the CARET package for each algorithm.

Generate test design

\(H_ {0}\): The \(Model_n\) give the best accuracy and sensitivity.

\(H_ {1}\): It do not give the best values.

Where \(n= (1,2,3,4,5,6)\) and it represents each listed model in the selection technique part.

Build model

To be able to generate the model, first, we need to standardize the data, as the variables have different scales. Nevertheless, we will normalize only the continuous variables, as the categorical and dummy variables have only few different levels.

Now that the normalization is done, lets move on by creation of the training and test set based on the data.This will be done by dividing it in a randomly selection into the two subsets, with 75% of the data in the training set and the remaining 25% in the test set.

As you can see above, the data have the same proportion in the dataset, the training and the test set. Specifically, in all of them the dependent variable is biased, since it shows a greater tendency for a positive response. For this reason we will evaluate two fits for each algorithm, one with the skewed data and the other with a balanced one. Finally, to be able to compare them we are going to compute the confusion matrix, which includes the following information:

  • \[ Accuracy = \frac{TruePositive+TrueNegative} {TruePositive+TrueNegative+FalsePositive+FalseNegative}\] ,
  • \[ Sensitivity = \frac{TruePositive} {TruePositive+FalseNegative}\] -\[ Specifity = \frac{TrueNegative} {TrueNegative+FalsePositive}\]

In synthesis, the sensitivity measures the true positive rate, which is key for this project, since a false positive has a negative impact on our main objective, as it would increase the risk of not being able to refund agreed payments. Meaning that in addition to balancing the data we will focus the second model on maximizing sensitivity.

M1: Logistic regression

The general equation for the model is:

\[ Z_{i} = ln(\frac{P_{i}} {1-P_{i}}) = \beta_0+\beta_1X_1+...+\beta_nX_n \]

For the application of the algorithm we will apply the following steps:

Data set Steps
Unbalanced data As we have sees earlier, the output variable is unbalanced. We are going to evaluate the accuracy and the sensitivity of the model, with the following steps:
1) Fit the model.
2) Coefficient analysis
3) Predict .
4) Confusion matrix .
Balanced data In this step, we are going to balance the data with the training.control function and, then, we will evaluate the accuracy and the sensitivity of the model, with the following steps:
5) Fit the model.
6) Predict
7) Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################

train_params <- caret::trainControl(method = "repeatedcv", number = 10, repeats=5) 
#10-Fold Cross Validation   #5 repetitions

mod_lg_fit <- caret::train(RESPONSE ~ ., TrainData, method="glm", 
                           family="binomial",trControl= train_params)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7192  -0.7199   0.3843   0.7077   2.3350  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.93510    0.81482  -1.148  0.25113    
## CHK_ACCT1       0.25725    0.24875   1.034  0.30106    
## CHK_ACCT2       1.15025    0.44137   2.606  0.00916 ** 
## CHK_ACCT3       1.65679    0.25994   6.374 1.84e-10 ***
## DURATION       -0.33249    0.12474  -2.665  0.00769 ** 
## HISTORY1       -0.35513    0.60991  -0.582  0.56039    
## HISTORY2        0.51189    0.48162   1.063  0.28784    
## HISTORY3        0.71627    0.52256   1.371  0.17047    
## HISTORY4        1.41024    0.49046   2.875  0.00404 ** 
## PURPOSE1       -0.92834    0.42913  -2.163  0.03052 *  
## PURPOSE2        0.83586    0.54279   1.540  0.12358    
## PURPOSE3       -0.08614    0.44419  -0.194  0.84624    
## PURPOSE4        0.05502    0.43423   0.127  0.89917    
## PURPOSE5       -0.73150    0.57782  -1.266  0.20553    
## PURPOSE6       -0.08247    0.49176  -0.168  0.86682    
## AMOUNT         -0.32341    0.14085  -2.296  0.02167 *  
## SAV_ACCT1       0.45888    0.33152   1.384  0.16630    
## SAV_ACCT2       0.24327    0.45461   0.535  0.59257    
## SAV_ACCT3       0.70072    0.55496   1.263  0.20671    
## SAV_ACCT4       1.31133    0.31892   4.112 3.93e-05 ***
## EMPLOYMENT1     0.25345    0.41874   0.605  0.54499    
## EMPLOYMENT2     0.63919    0.39563   1.616  0.10617    
## EMPLOYMENT3     1.11825    0.43882   2.548  0.01082 *  
## EMPLOYMENT4     0.72410    0.41270   1.755  0.07933 .  
## INSTALL_RATE   -0.35899    0.11182  -3.210  0.00133 ** 
## MALE_SINGLE1    0.54316    0.20976   2.590  0.00961 ** 
## GUARANTOR1      0.89175    0.48361   1.844  0.06519 .  
## PROPERTY1       0.06645    0.23901   0.278  0.78099    
## PROPERTY2      -0.53913    0.41424  -1.301  0.19309    
## OTHER_INSTALL1 -0.55776    0.24600  -2.267  0.02337 *  
## RESIDENCE1     -0.77956    0.50088  -1.556  0.11962    
## RESIDENCE2     -0.27558    0.47140  -0.585  0.55883    
## NUM_CREDITS    -0.11808    0.12232  -0.965  0.33440    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 916.30  on 749  degrees of freedom
## Residual deviance: 682.31  on 717  degrees of freedom
## AIC: 748.31
## 
## Number of Fisher Scoring iterations: 5

In this step, we can see that the variables that take the highest importance and that are statistically significant for the model are: the second and third level of CHK_ACC, DURATION, the fourth level of HISTORY,the first level of PURPOSE, AMOUNT,the fourth level of SAV_ACCT and the third level of EMPLOYMENT and the first level of OTHER_INSTALL.

2) Coefficients
Table 1: Significance of variable
.
CHK_ACCT1 FALSE
CHK_ACCT2 TRUE
CHK_ACCT3 TRUE
DURATION TRUE
HISTORY1 FALSE
HISTORY2 FALSE
HISTORY3 FALSE
HISTORY4 TRUE
PURPOSE1 TRUE
PURPOSE2 FALSE
PURPOSE3 FALSE
PURPOSE4 FALSE
PURPOSE5 FALSE
PURPOSE6 FALSE
AMOUNT TRUE
SAV_ACCT1 FALSE
SAV_ACCT2 FALSE
SAV_ACCT3 FALSE
SAV_ACCT4 TRUE
EMPLOYMENT1 FALSE
EMPLOYMENT2 FALSE
EMPLOYMENT3 TRUE
EMPLOYMENT4 FALSE
INSTALL_RATE TRUE
MALE_SINGLE1 TRUE
GUARANTOR1 FALSE
PROPERTY1 FALSE
PROPERTY2 FALSE
OTHER_INSTALL1 TRUE
RESIDENCE1 FALSE
RESIDENCE2 FALSE
NUM_CREDITS FALSE

If we look at the coeffiecients of the different variables we can conclude that, among the significant one that we described before, CHK_ACCT, HISTORY (all but the first level), SAV_ACCT, EMPLOYMENT and MALE_SINGLE, have a positive impact on the output, meaning that the higher is their level, or if they are positive, the probability of having RESPONSE = 1 will increase.

On the other hand, among the significant variables, DURATION, PURPOSE (all but level two and four), AMOUNT and OTHER_INSTALL have a negative effect on the output, meaning that if they increase their level or value, or if they have a positive value (for the dummies), the probability of having a positive response will decrease.

The linear predictor is given by \[ \eta = - 0.9 + 0.3 * CHKACCT_1 + 1.2 * CHKACCT_2 + 1.7 * CHKACCT_3 - 0.3 * DURATION - 0.4 * HISTORY_1 + 0.5 * HISTORY_2 + 0.7 * HISTORY_3 + 1.4 * HISTORY_4 - 0.9 * PURPOSE_1 + 0.8 * PURPOSE_2 - 0.08 * PURPOSE_3 + 0.05 * PURPOSE_4 - 0.7 * PURPOSE_5 - 0.08 * PURPOSE_6 - 0.3 * AMOUNT - 0.5 * SAVACCT_1 + 0.2 * SAVACCT_2 + 0.7 * SAVACCT_3 + 1.3 * SAVACCT_4 + 0.25 * EMPLOYMENT_1 + * 0.6 EMPLOYMENT_2 + 1.1 * EMPLOYMENT_3 + 0.7 * EMPLOYMENT_4 - 0.4 * INSTALLRATE + 0.5 * MALESINGLE_1 + 0.9 * GUARANTOR_1 + 0.06 * PROPERTY_1 - 0.5 * PROPERTY_2 - 0.6 * OTHERINSTALL_1 - 0.8 * RESIDENCE_1 - 0.3 * RESIDENCE_2 - 0.1 * NUMCREDITS + 0.4426 * TELEPHONE_1 \]

To be clear, if for example the purpose variable takes value 3, only the coefficient of PURPOSE_3 will be added to the others. The same goes for each other categorical variable. For the dummies the coefficient is added only if the value is equal to 1 for the variable, otherwise no. While for the continuous variables the coefficient is multiplied by the value that is recorded in the observation.

3) Prediction

Now we will get the predictions using this model. To being able to do it, we will start by getting the probabilities of the output given the coefficients we have found by fitting the model, then we will use a cut point of 0.5 to decide whether the value will be equal to 1 (if the probability it higher than 0.5) or 0 (otherwise). The model basically fit the information of the new observation in the function that is given above, and then it finds a value eta that is then used to get the prediction of the probability of the output by doint p = 1 / (1 + eta), which we will use to determine the class predicted for the outcome.

#prediction given the model
lg.pred <- predict(mod_lg_fit, newdata = TestData)  

The unbalance towards the positive value prediction is more than clear, in this graph.

4) Diagnosis

The sensitivity is quite low, at 52%, the specificity, though, is high, almost 90%, while the accuracy is at 78%. The number of false positive is quite high, being 36.

Balanced data

5) Fitting the model
#Same division
set.seed(1234)

#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down")

mod_lg_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="glm", 
                                  family="binomial", 
                                  metric = "Sens", #optimize sensitivity
                                  maximize = TRUE, #maximize the metric
                                  trControl= train_params)

################check outputs################################vv
summary(mod_lg_fitbalance)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.22739  -0.78261  -0.02752   0.79604   2.53360  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.8770     1.0650  -0.823 0.410229    
## CHK_ACCT1        0.3269     0.3181   1.028 0.304055    
## CHK_ACCT2        1.3096     0.5156   2.540 0.011079 *  
## CHK_ACCT3        1.9754     0.3241   6.096 1.09e-09 ***
## DURATION        -0.2847     0.1556  -1.830 0.067224 .  
## HISTORY1        -0.5277     0.7045  -0.749 0.453894    
## HISTORY2        -0.3269     0.5422  -0.603 0.546492    
## HISTORY3        -0.1523     0.5993  -0.254 0.799437    
## HISTORY4         0.8379     0.5514   1.519 0.128658    
## PURPOSE1        -1.4030     0.5170  -2.714 0.006656 ** 
## PURPOSE2         0.4736     0.6667   0.710 0.477406    
## PURPOSE3        -0.6911     0.5304  -1.303 0.192610    
## PURPOSE4        -0.6082     0.5162  -1.178 0.238638    
## PURPOSE5        -0.7442     0.6780  -1.098 0.272320    
## PURPOSE6        -0.7298     0.5839  -1.250 0.211399    
## AMOUNT          -0.2435     0.1755  -1.387 0.165326    
## SAV_ACCT1        0.5952     0.4076   1.460 0.144271    
## SAV_ACCT2       -0.2930     0.5523  -0.530 0.595804    
## SAV_ACCT3        0.4723     0.6553   0.721 0.471134    
## SAV_ACCT4        1.2719     0.3723   3.417 0.000634 ***
## EMPLOYMENT1      0.8801     0.5906   1.490 0.136204    
## EMPLOYMENT2      1.2155     0.5581   2.178 0.029427 *  
## EMPLOYMENT3      1.7196     0.6056   2.839 0.004519 ** 
## EMPLOYMENT4      1.0968     0.5823   1.883 0.059652 .  
## INSTALL_RATE    -0.3919     0.1387  -2.825 0.004730 ** 
## MALE_SINGLE1     0.7278     0.2648   2.749 0.005983 ** 
## GUARANTOR1       0.4654     0.6556   0.710 0.477814    
## PROPERTY1        0.2035     0.2913   0.699 0.484762    
## PROPERTY2       -1.0469     0.5950  -1.760 0.078487 .  
## OTHER_INSTALL1  -0.7800     0.3200  -2.438 0.014775 *  
## RESIDENCE1      -0.9846     0.6864  -1.435 0.151419    
## RESIDENCE2      -0.7140     0.6671  -1.070 0.284474    
## NUM_CREDITS     -0.1793     0.1529  -1.173 0.240747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 623.83  on 449  degrees of freedom
## Residual deviance: 449.68  on 417  degrees of freedom
## AIC: 515.68
## 
## Number of Fisher Scoring iterations: 5
6) Prediction
#probability given the model
lg.pred.b <- predict(mod_lg_fitbalance, newdata = TestData)  

Contrary to the output of the first model, we can see that the proportion of the prediction is better in the balanced case.

7) Diagnosis

The sensitivity is 72%, the specificity is 64% and the accuracy only 66%. The number of false positive is equal to 21.

M2: Decision trees

The next image illustrates better the way of working of decision trees.

Figure 4: A caption

A caption

The process consists in the minimization of the classification error rate:

\[E=\ 1-max_{k}(p_{mk})\] where \(p_{mk}\) is the proportion of training observation.

For the application of the algorithm we will apply the following steps:

Data set Steps
Unbalanced data As we have seen earlier, the output variable is unbalanced. We are going to evaluate the accuracy and the sensitivity of the model, with the following steps:
1) Fit the model.
2) Plot the best tree
3) Predict .
4) Confusion matrix .
Balanced data In this step, we are going to balance the data with the training.control function and, then, we will evaluate the accuracy and the sensitivity of the model, with the following steps:
5) Fit the model.
6) Plot the best tree
6) Predict
7)
Confusion matrix

Unbalanced Data

1) Fitting the model

We will start by fitting the model on the data.

#Same division
set.seed(1234)

#########################model######################################
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) 

#10-Fold Cross Validation #5 repetions

mod_dt_fit <- caret::train(RESPONSE ~ ., TrainData, method="rpart", 
                           trControl= train_params)
##         var          n               wt             dev             yval    
##  <leaf>   :8   Min.   : 14.0   Min.   : 14.0   Min.   :  3.0   Min.   :1.0  
##  AMOUNT   :2   1st Qu.: 34.0   1st Qu.: 34.0   1st Qu.:  9.0   1st Qu.:1.0  
##  DURATION :2   Median :126.0   Median :126.0   Median : 42.0   Median :2.0  
##  CHK_ACCT3:1   Mean   :185.1   Mean   :185.1   Mean   : 62.8   Mean   :1.6  
##  PROPERTY2:1   3rd Qu.:252.0   3rd Qu.:252.0   3rd Qu.: 84.0   3rd Qu.:2.0  
##  SAV_ACCT4:1   Max.   :750.0   Max.   :750.0   Max.   :225.0   Max.   :2.0  
##  (Other)  :0                                                                
##    complexity          ncompete       nsurrogate 
##  Min.   :0.000000   Min.   :0.000   Min.   :0.0  
##  1st Qu.:0.002222   1st Qu.:0.000   1st Qu.:0.0  
##  Median :0.019259   Median :0.000   Median :0.0  
##  Mean   :0.015605   Mean   :1.867   Mean   :0.8  
##  3rd Qu.:0.026667   3rd Qu.:4.000   3rd Qu.:0.5  
##  Max.   :0.026667   Max.   :4.000   Max.   :5.0  
##                                                  
##       yval2.V1             yval2.V2             yval2.V3             yval2.V4             yval2.V5          yval2.nodeprob   
##  Min.   :1.0          Min.   :  3.00000    Min.   :  4.0        Min.   :0.1174497    Min.   :0.1428571    Min.   :0.0186667  
##  1st Qu.:1.0          1st Qu.: 26.50000    1st Qu.: 15.5        1st Qu.:0.3346560    1st Qu.:0.3971861    1st Qu.:0.0453333  
##  Median :2.0          Median : 42.00000    Median : 60.0        Median :0.4285714    Median :0.5714286    Median :0.1680000  
##  Mean   :1.6          Mean   : 67.13333    Mean   :118.0        Mean   :0.4620491    Mean   :0.5379509    Mean   :0.2468444  
##  3rd Qu.:2.0          3rd Qu.: 84.50000    3rd Qu.:167.5        3rd Qu.:0.6028139    3rd Qu.:0.6653440    3rd Qu.:0.3360000  
##  Max.   :2.0          Max.   :225.00000    Max.   :525.0        Max.   :0.8571429    Max.   :0.8825503    Max.   :1.0000000  
## 
2) Plot

3) Prediction
#prediction given the model
dt.pred <- predict(mod_dt_fit, newdata = TestData)  #predict give me the probability i am looking for the the binomial answer

The prediction is clearly biased to a positive answer.

4) Diagnosis

We can see that here the sensitivity is really low, while the specificity is higher, reaching a value above 92%, which is in any case the one in which we are the most interested. The accuracy is around 74%. Here, in 52 cases in which the model should have given a negative value, it actually predicted a positive one, and it could cost quite a lot to the company.

Balanced data

5) Fitting the model
#Same division
set.seed(1234)

#########################model######################################

train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down")

mod_dt_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="rpart", 
                                  metric = "Sens", #optimize sensitivity
                                  maximize = TRUE, #maximize the metric
                                  trControl= train_params)
##         var          n               wt             dev              yval      
##  <leaf>   :6   Min.   :  8.0   Min.   :  8.0   Min.   :  1.00   Min.   :1.000  
##  AMOUNT   :3   1st Qu.: 52.0   1st Qu.: 52.0   1st Qu.: 12.00   1st Qu.:1.000  
##  CHK_ACCT3:1   Median :174.0   Median :174.0   Median : 58.00   Median :1.000  
##  SAV_ACCT4:1   Mean   :166.8   Mean   :166.8   Mean   : 61.55   Mean   :1.364  
##  CHK_ACCT1:0   3rd Qu.:227.5   3rd Qu.:227.5   3rd Qu.: 80.00   3rd Qu.:2.000  
##  CHK_ACCT2:0   Max.   :450.0   Max.   :450.0   Max.   :225.00   Max.   :2.000  
##  (Other)  :0                                                                   
##    complexity          ncompete       nsurrogate    
##  Min.   :0.000000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.005556   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :0.013333   Median :0.000   Median :0.0000  
##  Mean   :0.045657   Mean   :1.818   Mean   :0.5455  
##  3rd Qu.:0.022222   3rd Qu.:4.000   3rd Qu.:0.5000  
##  Max.   :0.364444   Max.   :4.000   Max.   :3.0000  
##                                                     
##       yval2.V1             yval2.V2             yval2.V3             yval2.V4             yval2.V5          yval2.nodeprob   
##  Min.   :1.0000000    Min.   :  1.00000    Min.   :  6.00000    Min.   :0.0833333    Min.   :0.1492537    Min.   :0.0177778  
##  1st Qu.:1.0000000    1st Qu.: 24.50000    1st Qu.: 17.00000    1st Qu.:0.3141892    1st Qu.:0.3424908    1st Qu.:0.1155556  
##  Median :1.0000000    Median :116.00000    Median : 64.00000    Median :0.6134021    Median :0.3865979    Median :0.3866667  
##  Mean   :1.3636364    Mean   : 95.72727    Mean   : 71.09091    Mean   :0.5030050    Mean   :0.4969950    Mean   :0.3707071  
##  3rd Qu.:2.0000000    3rd Qu.:147.50000    3rd Qu.: 96.50000    3rd Qu.:0.6575092    3rd Qu.:0.6858108    3rd Qu.:0.5055556  
##  Max.   :2.0000000    Max.   :225.00000    Max.   :225.00000    Max.   :0.8507463    Max.   :0.9166667    Max.   :1.0000000  
## 
6) Plot

7) Prediction

We can see that it is a bit more balanced, even if the number of positive predictions is still higher.

8) Diagnosis

We can see that here the sensitivity has improved, however the specificity is lower, reaching a value above 63%, which is in any case the one in which we are the most interested. The accuracy is around 60%.

M3: Discriminate analysis

There are four types of discriminate analysis, we will explain them in the following table:

Model Definition
1 LDA > Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher’s linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events.
(https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
2 QDA >This method assume that the measurements from each class are normally distributed, but there is not assumption saying that the covariance of each of the classes is identical.When the normality assumption is true, the best possible test for the hypothesis that a given measurement is from a given class is the likelihood ratio test.
(https://en.wikipedia.org/wiki/Quadratic_classifier)
3 FDA > It analyzes data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework each sample element is considered to be a function.
(https://en.wikipedia.org/wiki/Functional_data_analysis)
4 MDA > It is a multivariate dimensionality reduction technique. It has been used to predict signals as diverse as neural memory traces and corporate failure.
(https://en.wikipedia.org/wiki/Multiple_discriminant_analysis)

The next image illustrates better the way of working for each model.

Linear Discriminant Analysis

Steps for the application of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Predict .
3) Confusion matrix .
Balanced data 4) Fit the model.
5) Predict
6) Confusion matrix

Unbalanced

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, repeats=5) 
#K-Fold Cross Validation
mod_lda_fit <- caret::train(RESPONSE ~ ., TrainData, method="lda", 
                           family="binomial",trControl= train_params)
##             Length Class      Mode     
## prior        2     -none-     numeric  
## counts       2     -none-     numeric  
## means       64     -none-     numeric  
## scaling     32     -none-     numeric  
## lev          2     -none-     character
## svd          1     -none-     numeric  
## N            1     -none-     numeric  
## call         4     -none-     call     
## xNames      32     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        1     -none-     list

The linear combination of predictor variables that are used to form the decision rule is the following:

\[ RESPONSE = -0.3265 * DURATION -0.2747 * HISTORY_1 + 0.8792 * HISTORY_2 + 1.1810 * HISTORY_3 + 1.6214 * HISTORY_4 - 0.7437 * PURPOSE_1 + 0.7736 * PURPOSE_2 + 0.0172 * PURPOSE_3 + 0.2035 * PURPOSE_4 - 0.6116 * PURPOSE_5 + 0.1298 * PURPOSE_6 - 0.2579 * AMOUNT + 0.5066 * SAV_ACCT_1 + 0.7517 * SAV_ACCT_2 + 0.7778 * SAV_ACCT_3 + 1.0997 * SAV_ACCT_4 + 0.6175 * EMPLOYMENT_1 + 1.1982 * EMPLOYMENT_2 + 1.4580 * EMPLOYMENT_3 + 1.1806 * EMPLOYMENT_4 - 0.2897 * INSTALL_RATE - 0.5663 * SEX_MALE_1 + 0.9272 * MALE_SINGLE_1 + 0.5183 * MALE_MAR_WID_1 - 0.0863 * CO_APPLICANT_1 + 0.5084 * GUARANTOR_1 - 0.2437 * PRESENT_RESIDENT_-1 - 0.1913 * PRESENT_RESIDENT_0 - 0.0477 * PRESENT_RESIDENT_1 + 0.0367 * PROPERTY_1 - 0.6374 * PROPERTY_2 + 0.0502 * AGE - 0.4501 * OTHER_INSTALL_1 - 0.7509 * RESIDENCE_1 - 0.2185 * RESIDENCE_2 - 0.0703 * NUM_CREDITS - 1.0570 * JOB_1 - 1.0581 * JOB_2 - 0.8362 * JOB_3\]

Each new observation will be evaluated thanks to this formula, with its information put inside of it. It follows the same principle described for the generalized linear model.

2) Prediction (unbalance)
lda.pred <- predict(mod_lda_fit, newdata = TestData)  #predict give me the probability i am looking for the the binomial answer

With this graph we confirm what has been explained in the fit part, which is the prediction tending to give a positive response.

3) Diagnosis (Unbalance)

Here the sensitivity is higher with respect to the previous unbalance models (above 40%), but it is still quite low. If we look at the accuracy, is quite low, as it is only around 78%. What is important to note is that 36 times in which the model would have predicted a positive value for the output, it should have been negative, which is something that could cost quite a lot to the copmany.

Balanced data

4) Fitting the model: balance
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down") 


mod_lda_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="lda", 
                           family="binomial",
                           metric = "Sens", #optimize sensitivity
                           maximize = TRUE, #maximize the metric
                           trControl= train_params)
##             Length Class      Mode     
## prior        2     -none-     numeric  
## counts       2     -none-     numeric  
## means       64     -none-     numeric  
## scaling     32     -none-     numeric  
## lev          2     -none-     character
## svd          1     -none-     numeric  
## N            1     -none-     numeric  
## call         4     -none-     call     
## xNames      32     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        1     -none-     list
5) Prediction
lda.pred.b <- predict(mod_lda_fitbalance, newdata = TestData)  #predict give me the probability i am looking for the the binomial answer

We can see that the situation is more balanced.

6) Diagnosis (balance)

The sensitivity is 73%, the specificity is only 65% and the accuracy is also low, at 67%. The number of false positive, though, is only 20.

Quadratic discriminant analysis

Steps for the aplication of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Predict .
3) Confusion matrix .
Balanced data 4) Fit the model.
5) Predict
6) Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation
mod_qda_fit <- caret::train(RESPONSE ~ ., TrainData, method="qda", 
                           family="binomial",trControl= train_params)
##             Length Class      Mode     
## prior          2   -none-     numeric  
## counts         2   -none-     numeric  
## means         64   -none-     numeric  
## scaling     2048   -none-     numeric  
## ldet           2   -none-     numeric  
## lev            2   -none-     character
## N              1   -none-     numeric  
## call           4   -none-     call     
## xNames        32   -none-     character
## problemType    1   -none-     character
## tuneValue      1   data.frame list     
## obsLevels      2   -none-     character
## param          1   -none-     list
2) Prediction (unbalance)

Here, it seems still that the majority of the false prediction are in the positive level, however they seem less than before.

3) Diagnosis (Unbalance)

The sensitivity is 54%, the specificity is high, reaching almost 84%, and the accuracy is 75%. The number of false positive, however, is 34.

As we predicted, the model performs a little bit worse than the LDA, but for the sensitivity, which is the highest up to now (over 50%), is still quite low, though. The specificity is moderately high (above 80%) as it is the accuracy (above 70%). As we want to have a value for the false positive low, the 34 here is still quite high.

Balanced data

4) Fitting the model: balance
5) Prediction
qda.pred.b <- predict(mod_qda_fitbalance, newdata = TestData)  

We can see that there is still some unbalance towards the positive value.

6) Diagnosis (balance)

The sensitivity is 65%, the specificity is 71% and the accuracy almost 70&, while the false positive are 26.

Functional data analysis (FDA)

Steps for the application of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Predict .
3) Confusion matrix .
Balanced data 4) Fit the model.
5) Predict
6) Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation

library(earth)
mod_fda_fit <- caret::train(RESPONSE ~ ., TrainData, method="fda", 
                              trControl= train_params)
## Flexible Discriminant Analysis 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ... 
## Resampling results across tuning parameters:
## 
##   nprune  Accuracy   Kappa    
##    2      0.7000199  0.0000000
##   13      0.7349707  0.3090363
##   25      0.7477476  0.3574288
## 
## Tuning parameter 'degree' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 1 and nprune = 25.
2) Prediction (unbalance)
fda.pred <- predict(mod_fda_fit, newdata = TestData)  

The unbalance towards the positive value is more than clear in this graph.

3) Diagnosis (Unbalance)

This model has a sensitivity of almost 55%, among one of the highest up to now, and the specificity is higher than 85%. The accuracy is around 76%. The false positive observations are 34.

Balanced data

4) Fitting the model: balance
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down")

mod_fda_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="fda", 
                                   metric = "Sens", #optimize sensitivity
                                    maximize = TRUE,
                                    trControl= train_params)
##                   Length Class      Mode     
## percent.explained  1     -none-     numeric  
## values             1     -none-     numeric  
## means              2     -none-     numeric  
## theta.mod          1     -none-     numeric  
## dimension          1     -none-     numeric  
## prior              2     table      numeric  
## fit               29     earth      list     
## call               7     -none-     call     
## terms              3     terms      call     
## confusion          4     table      numeric  
## xNames            32     -none-     character
## problemType        1     -none-     character
## tuneValue          2     data.frame list     
## obsLevels          2     -none-     character
## param              0     -none-     list
5) Prediction

Using the model we get the predictions for the RESPONSE variable and we can construct the confidence matrix for this case.

fda.pred.b <- predict(mod_fda_fitbalance, newdata = TestData)  #predict give me the probability i am looking for the the binomial answer

We can see that the situation is more balanced.

6) Diagnosis (balance)

Here, the sensitivity is almost 79%, while the specificity is 66%, with an accuracy of almost 70%. The false positive are really low, reaching 16 observations.

Mixture discriminant analysis (MDA)

Steps for the application of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Predict .
3) Confusion matrix .
Balanced data 4) Fit the model.
5) Predict
6) Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation
mod_mda_fit <- caret::train(RESPONSE ~ ., TrainData, method="mda", 
                           family="binomial",trControl= train_params)
##                   Length Class      Mode     
## percent.explained  3     -none-     numeric  
## values             3     -none-     numeric  
## means             12     -none-     numeric  
## theta.mod          9     -none-     numeric  
## dimension          1     -none-     numeric  
## sub.prior          2     -none-     list     
## fit                5     polyreg    list     
## call               5     -none-     call     
## weights            2     -none-     list     
## prior              2     table      numeric  
## assign.theta       2     -none-     list     
## deviance           1     -none-     numeric  
## confusion          4     table      numeric  
## terms              3     terms      call     
## xNames            32     -none-     character
## problemType        1     -none-     character
## tuneValue          1     data.frame list     
## obsLevels          2     -none-     character
## param              1     -none-     list
2) Prediction (unbalance)
mda.pred <- predict(mod_mda_fit, newdata = TestData)  

The unbalance towards the positive value is more than clear in this graph.

3) Diagnosis (Unbalance)

The sensitivity in this case is around 55%, while the specificity is higher, reaching 85%. The accurary is around 75% and there are 34 false positive.

Balanced data

4) Fitting the model
#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down")

mod_mda_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="mda", 
                           family="binomial",
                           metric = "Sens", #optimize sensitivity
                           maximize = TRUE,
                           trControl= train_params)
## Mixture Discriminant Analysis 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 676, 675, 674, 676, 674, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   subclasses  Accuracy   Kappa    
##   2           0.6984336  0.3549508
##   3           0.6993789  0.3545579
##   4           0.6829469  0.3139389
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was subclasses = 3.
5) Prediction
mda.pred.b <- predict(mod_mda_fitbalance, newdata = TestData)  #predict give me the probability i am looking for the the binomial answer

We have still some unbalance toward the positive value.

6) Diagnosis

Here the sesntitivty is 72%, the specificity 74% and the accuracy 73%. The false positive are decreasing, with a value of 21.

M4: Random Forest

Steps for the application of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2)Checking Variables.
3) Predict .
4) Confusion matrix .
Balanced data 5) Fit the model.
6) Predict
7) Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################vvvv
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) #K-Fold Cross Validation
mod_rf_fit <- caret::train(RESPONSE ~ ., TrainData, method="rf", 
                           trControl= train_params)
## Random Forest 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7235177  0.1525749
##   17    0.7477162  0.3489012
##   32    0.7445090  0.3450546
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 17.

The summary of the model gives the decrease in accuracy and the decrease of the gini index for each variable in the model, along with the number of trees that are built (500 in our case), the number of variabes that are randomly chosen to be tried at each split before determining which one is the best one to describe the node. Moreover, we can already find the confusion matrix (we will show it better again afterwards to keep the coherence of the analysis throuhgout all the models), with the class errorand the Out-Of-Bag estimate of the error rate.

Let’s give some definitions to be clearer:

Variable importance is the mean decrease of accuracy over all out-of-bag cross validated predictions, when a given variable is permuted after training, but before prediction.

GINI importance measures the average gain of purity by splits of a given variable. If the variable is useful, it tends to split mixed labeled nodes into pure single class nodes. Splitting by a permuted variables tend neither to increase nor decrease node purities.

source:https://stats.stackexchange.com/questions/197827/how-to-interpret-mean-decrease-in-accuracy-and-mean-decrease-gini-in-random-fore

Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging) to sub-sample data samples used for training. OOB is the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample. source: https://en.wikipedia.org/wiki/Out-of-bag_error

2) Checking Variables
##                        0          1 MeanDecreaseAccuracy MeanDecreaseGini
## CHK_ACCT      30.8377522 14.9310529          29.62471840        24.909137
## DURATION       3.0789831 17.2512610          16.97809183        23.722782
## HISTORY       10.6308613 10.2319846          14.95907324        16.225153
## PURPOSE        4.6632195  3.8556094           5.99145514        21.218288
## AMOUNT         2.7620518 12.2473068          13.09750874        37.191057
## SAV_ACCT       9.7424546  2.4221673           7.48429134        12.897821
## EMPLOYMENT     4.7202650  3.2030500           5.62544435        15.860447
## INSTALL_RATE  -1.5100678  2.6360360           1.30398030        10.326397
## MALE_SINGLE    3.2165679  0.1469817           2.12884373         5.216763
## GUARANTOR      2.6848191  9.3310231           9.12005837         2.063821
## PROPERTY       0.8703037  4.2813758           4.00866662         7.959729
## OTHER_INSTALL  1.6732583  3.4119582           3.71946567         4.919703
## RESIDENCE     -0.4792045  0.2264302          -0.09300405         6.525084
## NUM_CREDITS   -1.0737951  4.8444649           3.40369292         5.752978

The most important variables appear to be CHK_ACC, DURATION and HISTORY in terms of Accuracy and AMOUNT, CHK_ACC and DURATION in terms of gini index, which is consistent with what we have found up to now.

3) Prediction (unbalance)

The predictions shows a clear preference towards the positive value.

4) Diagnosis (Unbalance)

Here the sensitivity is around 46%, the specificity is high (more than 87%) and the accuracy is around 75%, while the number of false positive is 40 observations.

Balanced data

5) Fitting the model: balance
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down")

mod_rf_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="rf", 
                           family="binomial",
                           metric = "Sens", #optimize sensitivity
                           maximize = TRUE,
                           trControl= train_params)
## Random Forest 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 674, 676, 675, 675, 675, 675, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.6779684  0.3442597
##   17    0.6877403  0.3509316
##   32    0.6842658  0.3449484
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 17.
6) Prediction
rf.pred.b <- predict(mod_rf_fitbalance, newdata = TestData)  

We can see that the predictions are more balanced.

7) Diagnosis

Here the sensitivity is 78%, the specificity is 65% and the accuracy almost 70%. However the false positive are really low, being only 16 cases.

M5: Neural Networks

Steps for the application of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Plot
3) Predict .
4) Confusion matrix .
Balanced data 5) Fit the model.
6) Plot
7) Predict
8) Confusion matrix

Unbalanced data

1) Fitting the model
#Same division
set.seed(1234)

#########################model######################################
train_params <- trainControl(method = "repeatedcv", number = 10, repeats=5) 
mod_nn_fit <- caret::train(RESPONSE ~ ., TrainData, method="nnet", 
                           trControl= train_params)
## Neural Network 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   1     0e+00  0.7050009  0.3312729
##   1     1e-04  0.7093730  0.3370020
##   1     1e-01  0.7509943  0.3830480
##   3     0e+00  0.7003229  0.2899405
##   3     1e-04  0.7104328  0.2919372
##   3     1e-01  0.7254274  0.3296829
##   5     0e+00  0.6965142  0.2769069
##   5     1e-04  0.7045505  0.2925045
##   5     1e-01  0.7069400  0.2912240
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 1 and decay = 0.1.
2) Plot

3) Prediction (unbalance)
nn.pred <- predict(mod_nn_fit, newdata = TestData)  

We can see an unblanced result toward the positive value.

4) Diagnosis (Unbalance)

Here, the sensitivity is 53%, but the specificity is almost 88%, with an accuracy of 77%. The false positive, however, are still 35.

Balanced data

5) Fitting the model: balance
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down")

mod_nn_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="nnet", 
                           family="binomial",
                           metric = "Sens", #optimize sensitivity
                           maximize = TRUE,
                           trControl= train_params)
## Neural Network 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 675, 675, 675, 676, 675, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   1     0e+00  0.6668277  0.2984470
##   1     1e-04  0.6847165  0.3275770
##   1     1e-01  0.6918545  0.3515522
##   3     0e+00  0.6758101  0.2945431
##   3     1e-04  0.6611416  0.2806308
##   3     1e-01  0.6741378  0.3118398
##   5     0e+00  0.6700138  0.2968991
##   5     1e-04  0.6525221  0.2713653
##   5     1e-01  0.6648893  0.3005763
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 1 and decay = 0.1.
6) Prediction

7) Diagnosis (balance)

Here we see a sesitivity of 77%, a really high specificity among the ones we have found (61%) and an accuracy of only 66%. However, the false positive are quite low, being only 17.

M6: XGBoost

Steps for the application of the algorithm:

Data set Steps
Unbalanced data 1) Fit the model.
2) Plot
3) Predict .
3) Confusion matrix .
Balanced data 4) Fit the model.
5) Predict
6) Confusion matrix

Unbalanced data

1) Fitting the model
######################### transform data ############
data_xgboost <- purrr::map_df(data_scale, function(columna) {
                          columna %>% 
                          as.factor() %>% 
                          as.numeric %>% 
                          { . - 1 } })

test_xgboost <- sample_frac(data_xgboost, size = 0.249)
train_xgboost <- setdiff(data_xgboost, test_xgboost)


#Convertir a DMatrix

train_xgb_matrix <-   train_xgboost %>% 
                            dplyr::select(- RESPONSE) %>% 
                            as.matrix() %>% 
                            xgboost::xgb.DMatrix(data = ., label = train_xgboost$RESPONSE)
#Convertir a DMatrix

test_xgb_matrix <-  test_xgboost %>% 
                            dplyr::select(- RESPONSE) %>% 
                            as.matrix() %>% 
                            xgboost::xgb.DMatrix(data = ., label = test_xgboost$RESPONSE)

#Same division
set.seed(1234)

#########################model######################################
train_params <- caret::trainControl(method = "repeatedcv", 
                             number = 10, # with n folds 
                             repeats=5) #K-Fold Cross Validation

mod_xgb_fit <- caret::train(RESPONSE ~ ., TrainData, 
                           method="xgbTree", 
                           trControl= train_params)
## eXtreme Gradient Boosting 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 675, 676, 675, 674, 674, 675, ... 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  Accuracy   Kappa    
##   0.3  1          0.6               0.50        50      0.7418943  0.3130889
##   0.3  1          0.6               0.50       100      0.7510080  0.3596714
##   0.3  1          0.6               0.50       150      0.7485370  0.3576183
##   0.3  1          0.6               0.75        50      0.7448313  0.3088400
##   0.3  1          0.6               0.75       100      0.7555702  0.3640103
##   0.3  1          0.6               0.75       150      0.7596064  0.3826587
##   0.3  1          0.6               1.00        50      0.7320230  0.2517796
##   0.3  1          0.6               1.00       100      0.7472031  0.3250098
##   0.3  1          0.6               1.00       150      0.7541726  0.3577772
##   0.3  1          0.8               0.50        50      0.7491559  0.3330483
##   0.3  1          0.8               0.50       100      0.7499660  0.3528052
##   0.3  1          0.8               0.50       150      0.7560643  0.3778748
##   0.3  1          0.8               0.75        50      0.7408132  0.2968492
##   0.3  1          0.8               0.75       100      0.7528323  0.3628258
##   0.3  1          0.8               0.75       150      0.7544858  0.3723502
##   0.3  1          0.8               1.00        50      0.7272474  0.2373587
##   0.3  1          0.8               1.00       100      0.7482700  0.3303301
##   0.3  1          0.8               1.00       150      0.7528463  0.3540888
##   0.3  2          0.6               0.50        50      0.7488286  0.3516789
##   0.3  2          0.6               0.50       100      0.7522569  0.3781276
##   0.3  2          0.6               0.50       150      0.7525873  0.3773257
##   0.3  2          0.6               0.75        50      0.7563457  0.3724741
##   0.3  2          0.6               0.75       100      0.7560932  0.3862271
##   0.3  2          0.6               0.75       150      0.7558121  0.3935367
##   0.3  2          0.6               1.00        50      0.7507128  0.3463173
##   0.3  2          0.6               1.00       100      0.7590378  0.3887471
##   0.3  2          0.6               1.00       150      0.7593079  0.3971011
##   0.3  2          0.8               0.50        50      0.7539065  0.3696032
##   0.3  2          0.8               0.50       100      0.7509478  0.3686413
##   0.3  2          0.8               0.50       150      0.7469724  0.3692421
##   0.3  2          0.8               0.75        50      0.7488107  0.3507391
##   0.3  2          0.8               0.75       100      0.7531630  0.3804305
##   0.3  2          0.8               0.75       150      0.7531277  0.3839775
##   0.3  2          0.8               1.00        50      0.7536859  0.3597987
##   0.3  2          0.8               1.00       100      0.7579463  0.3883921
##   0.3  2          0.8               1.00       150      0.7566413  0.3884018
##   0.3  3          0.6               0.50        50      0.7445868  0.3541288
##   0.3  3          0.6               0.50       100      0.7398466  0.3546145
##   0.3  3          0.6               0.50       150      0.7304378  0.3286955
##   0.3  3          0.6               0.75        50      0.7483097  0.3643967
##   0.3  3          0.6               0.75       100      0.7441073  0.3627462
##   0.3  3          0.6               0.75       150      0.7435769  0.3611827
##   0.3  3          0.6               1.00        50      0.7464711  0.3509138
##   0.3  3          0.6               1.00       100      0.7493550  0.3673045
##   0.3  3          0.6               1.00       150      0.7490631  0.3703037
##   0.3  3          0.8               0.50        50      0.7419980  0.3507760
##   0.3  3          0.8               0.50       100      0.7421971  0.3551587
##   0.3  3          0.8               0.50       150      0.7355441  0.3447870
##   0.3  3          0.8               0.75        50      0.7482851  0.3599911
##   0.3  3          0.8               0.75       100      0.7429689  0.3586103
##   0.3  3          0.8               0.75       150      0.7386764  0.3455336
##   0.3  3          0.8               1.00        50      0.7470363  0.3559097
##   0.3  3          0.8               1.00       100      0.7499492  0.3705114
##   0.3  3          0.8               1.00       150      0.7445618  0.3602553
##   0.4  1          0.6               0.50        50      0.7504645  0.3545303
##   0.4  1          0.6               0.50       100      0.7542409  0.3709721
##   0.4  1          0.6               0.50       150      0.7481281  0.3627715
##   0.4  1          0.6               0.75        50      0.7501758  0.3327702
##   0.4  1          0.6               0.75       100      0.7518508  0.3604920
##   0.4  1          0.6               0.75       150      0.7579531  0.3831873
##   0.4  1          0.6               1.00        50      0.7410695  0.2950260
##   0.4  1          0.6               1.00       100      0.7523093  0.3525505
##   0.4  1          0.6               1.00       150      0.7528890  0.3598949
##   0.4  1          0.8               0.50        50      0.7501871  0.3505324
##   0.4  1          0.8               0.50       100      0.7520644  0.3668550
##   0.4  1          0.8               0.50       150      0.7553003  0.3747199
##   0.4  1          0.8               0.75        50      0.7502080  0.3455585
##   0.4  1          0.8               0.75       100      0.7534442  0.3653316
##   0.4  1          0.8               0.75       150      0.7561533  0.3788979
##   0.4  1          0.8               1.00        50      0.7426767  0.2984313
##   0.4  1          0.8               1.00       100      0.7504569  0.3466898
##   0.4  1          0.8               1.00       150      0.7536928  0.3621781
##   0.4  2          0.6               0.50        50      0.7523672  0.3749669
##   0.4  2          0.6               0.50       100      0.7520434  0.3820442
##   0.4  2          0.6               0.50       150      0.7477865  0.3706605
##   0.4  2          0.6               0.75        50      0.7552045  0.3748880
##   0.4  2          0.6               0.75       100      0.7544863  0.3892710
##   0.4  2          0.6               0.75       150      0.7528725  0.3849105
##   0.4  2          0.6               1.00        50      0.7563633  0.3736177
##   0.4  2          0.6               1.00       100      0.7603679  0.3941108
##   0.4  2          0.6               1.00       150      0.7544439  0.3840109
##   0.4  2          0.8               0.50        50      0.7501771  0.3682866
##   0.4  2          0.8               0.50       100      0.7448568  0.3668428
##   0.4  2          0.8               0.50       150      0.7442241  0.3681954
##   0.4  2          0.8               0.75        50      0.7547566  0.3807891
##   0.4  2          0.8               0.75       100      0.7539527  0.3855215
##   0.4  2          0.8               0.75       150      0.7480716  0.3766680
##   0.4  2          0.8               1.00        50      0.7520894  0.3618823
##   0.4  2          0.8               1.00       100      0.7507953  0.3722940
##   0.4  2          0.8               1.00       150      0.7520862  0.3768357
##   0.4  3          0.6               0.50        50      0.7410843  0.3569199
##   0.4  3          0.6               0.50       100      0.7312583  0.3342736
##   0.4  3          0.6               0.50       150      0.7286133  0.3343768
##   0.4  3          0.6               0.75        50      0.7403550  0.3484262
##   0.4  3          0.6               0.75       100      0.7390144  0.3507854
##   0.4  3          0.6               0.75       150      0.7403478  0.3531362
##   0.4  3          0.6               1.00        50      0.7493269  0.3705657
##   0.4  3          0.6               1.00       100      0.7456078  0.3664372
##   0.4  3          0.6               1.00       150      0.7383996  0.3517229
##   0.4  3          0.8               0.50        50      0.7415005  0.3566296
##   0.4  3          0.8               0.50       100      0.7353452  0.3471341
##   0.4  3          0.8               0.50       150      0.7305121  0.3298686
##   0.4  3          0.8               0.75        50      0.7472678  0.3672087
##   0.4  3          0.8               0.75       100      0.7414042  0.3543595
##   0.4  3          0.8               0.75       150      0.7365968  0.3419264
##   0.4  3          0.8               1.00        50      0.7488359  0.3643174
##   0.4  3          0.8               1.00       100      0.7471971  0.3686439
##   0.4  3          0.8               1.00       150      0.7423783  0.3605316
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 2, eta
##  = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
##  = 1.
2) Plot
##    nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 80     100         2 0.4     0              0.6                1         1
3) Prediction (unbalance)
xgb.pred <- predict(mod_xgb_fit, newdata = TestData)  

We can see that there is a higher probability of predicting a positive value.

4) Diagnosis (Unbalance)

Here the sensitivity is quite low, 53%, the specificity is at 87% though and the accuracy at 77%. The number of false positive is high, being 35.

Balanced data

5) Fitting the model: balance
train_params <- caret::trainControl(method = "repeatedcv", number = 10, 
                                    repeats=5, sampling = "down")

mod_xgb_fitbalance <- caret::train(RESPONSE ~ ., TrainData, method="xgbTree", 
                            metric = "Sens", #optimize sensitivity
                           maximize = TRUE,
                           trControl= train_params)
## eXtreme Gradient Boosting 
## 
## 750 samples
##  14 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 676, 674, 675, 675, 674, 676, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  Accuracy   Kappa    
##   0.3  1          0.6               0.50        50      0.6989449  0.3682164
##   0.3  1          0.6               0.50       100      0.7037523  0.3721827
##   0.3  1          0.6               0.50       150      0.7004950  0.3646692
##   0.3  1          0.6               0.75        50      0.7029369  0.3805843
##   0.3  1          0.6               0.75       100      0.7165677  0.3981571
##   0.3  1          0.6               0.75       150      0.7106612  0.3888852
##   0.3  1          0.6               1.00        50      0.6840729  0.3469812
##   0.3  1          0.6               1.00       100      0.7053843  0.3809739
##   0.3  1          0.6               1.00       150      0.7069175  0.3800252
##   0.3  1          0.8               0.50        50      0.7037528  0.3783023
##   0.3  1          0.8               0.50       100      0.7028241  0.3746458
##   0.3  1          0.8               0.50       150      0.7132782  0.3853881
##   0.3  1          0.8               0.75        50      0.6967969  0.3702232
##   0.3  1          0.8               0.75       100      0.7098792  0.3887799
##   0.3  1          0.8               0.75       150      0.7078024  0.3804073
##   0.3  1          0.8               1.00        50      0.6882519  0.3537514
##   0.3  1          0.8               1.00       100      0.7034681  0.3748476
##   0.3  1          0.8               1.00       150      0.7120201  0.3883081
##   0.3  2          0.6               0.50        50      0.6860122  0.3469735
##   0.3  2          0.6               0.50       100      0.6897958  0.3476942
##   0.3  2          0.6               0.50       150      0.6860405  0.3380399
##   0.3  2          0.6               0.75        50      0.6972873  0.3615264
##   0.3  2          0.6               0.75       100      0.6994387  0.3624071
##   0.3  2          0.6               0.75       150      0.7014908  0.3687918
##   0.3  2          0.6               1.00        50      0.6983691  0.3650248
##   0.3  2          0.6               1.00       100      0.7021097  0.3672225
##   0.3  2          0.6               1.00       150      0.7026747  0.3657473
##   0.3  2          0.8               0.50        50      0.7053455  0.3783005
##   0.3  2          0.8               0.50       100      0.7066934  0.3762670
##   0.3  2          0.8               0.50       150      0.6952503  0.3555109
##   0.3  2          0.8               0.75        50      0.6999555  0.3670906
##   0.3  2          0.8               0.75       100      0.7055447  0.3728668
##   0.3  2          0.8               0.75       150      0.7020037  0.3671155
##   0.3  2          0.8               1.00        50      0.6975971  0.3673505
##   0.3  2          0.8               1.00       100      0.7020708  0.3749196
##   0.3  2          0.8               1.00       150      0.7023695  0.3696903
##   0.3  3          0.6               0.50        50      0.6858627  0.3369000
##   0.3  3          0.6               0.50       100      0.6827122  0.3316641
##   0.3  3          0.6               0.50       150      0.6813246  0.3282858
##   0.3  3          0.6               0.75        50      0.6946494  0.3580541
##   0.3  3          0.6               0.75       100      0.6918126  0.3483333
##   0.3  3          0.6               0.75       150      0.6827554  0.3265486
##   0.3  3          0.6               1.00        50      0.6921145  0.3542610
##   0.3  3          0.6               1.00       100      0.6878119  0.3420561
##   0.3  3          0.6               1.00       150      0.6872252  0.3386723
##   0.3  3          0.8               0.50        50      0.6855642  0.3367933
##   0.3  3          0.8               0.50       100      0.6833990  0.3318456
##   0.3  3          0.8               0.50       150      0.6822781  0.3289131
##   0.3  3          0.8               0.75        50      0.7023904  0.3765114
##   0.3  3          0.8               0.75       100      0.6975789  0.3617480
##   0.3  3          0.8               0.75       150      0.6970494  0.3584640
##   0.3  3          0.8               1.00        50      0.6905000  0.3486176
##   0.3  3          0.8               1.00       100      0.6843731  0.3336077
##   0.3  3          0.8               1.00       150      0.6830252  0.3316754
##   0.4  1          0.6               0.50        50      0.7013589  0.3753652
##   0.4  1          0.6               0.50       100      0.7050826  0.3762892
##   0.4  1          0.6               0.50       150      0.7077956  0.3790251
##   0.4  1          0.6               0.75        50      0.7098295  0.3885280
##   0.4  1          0.6               0.75       100      0.7176034  0.4011933
##   0.4  1          0.6               0.75       150      0.7112060  0.3866502
##   0.4  1          0.6               1.00        50      0.7010636  0.3755244
##   0.4  1          0.6               1.00       100      0.7072049  0.3823603
##   0.4  1          0.6               1.00       150      0.7087556  0.3825659
##   0.4  1          0.8               0.50        50      0.7012810  0.3717004
##   0.4  1          0.8               0.50       100      0.7071444  0.3772126
##   0.4  1          0.8               0.50       150      0.7093061  0.3793106
##   0.4  1          0.8               0.75        50      0.6959931  0.3608162
##   0.4  1          0.8               0.75       100      0.7074396  0.3736301
##   0.4  1          0.8               0.75       150      0.7071586  0.3749501
##   0.4  1          0.8               1.00        50      0.7015932  0.3757983
##   0.4  1          0.8               1.00       100      0.7090826  0.3856597
##   0.4  1          0.8               1.00       150      0.7138514  0.3922263
##   0.4  2          0.6               0.50        50      0.6933618  0.3484379
##   0.4  2          0.6               0.50       100      0.6861285  0.3390571
##   0.4  2          0.6               0.50       150      0.6970958  0.3583372
##   0.4  2          0.6               0.75        50      0.7036958  0.3740340
##   0.4  2          0.6               0.75       100      0.7098623  0.3868494
##   0.4  2          0.6               0.75       150      0.7077277  0.3820773
##   0.4  2          0.6               1.00        50      0.6956986  0.3609303
##   0.4  2          0.6               1.00       100      0.6983264  0.3611671
##   0.4  2          0.6               1.00       150      0.6986006  0.3601811
##   0.4  2          0.8               0.50        50      0.6940845  0.3465297
##   0.4  2          0.8               0.50       100      0.6864174  0.3347040
##   0.4  2          0.8               0.50       150      0.6898281  0.3386950
##   0.4  2          0.8               0.75        50      0.7018429  0.3691133
##   0.4  2          0.8               0.75       100      0.7071163  0.3801981
##   0.4  2          0.8               0.75       150      0.6991079  0.3635405
##   0.4  2          0.8               1.00        50      0.6983864  0.3634779
##   0.4  2          0.8               1.00       100      0.7013200  0.3703492
##   0.4  2          0.8               1.00       150      0.6983653  0.3626481
##   0.4  3          0.6               0.50        50      0.6770544  0.3220538
##   0.4  3          0.6               0.50       100      0.6766991  0.3228922
##   0.4  3          0.6               0.50       150      0.6766571  0.3192328
##   0.4  3          0.6               0.75        50      0.6893471  0.3460660
##   0.4  3          0.6               0.75       100      0.6959780  0.3568683
##   0.4  3          0.6               0.75       150      0.6938557  0.3516515
##   0.4  3          0.6               1.00        50      0.7043037  0.3790669
##   0.4  3          0.6               1.00       100      0.6983864  0.3659641
##   0.4  3          0.6               1.00       150      0.6951825  0.3571023
##   0.4  3          0.8               0.50        50      0.6816172  0.3291605
##   0.4  3          0.8               0.50       100      0.6833557  0.3306590
##   0.4  3          0.8               0.50       150      0.6721472  0.3044708
##   0.4  3          0.8               0.75        50      0.6895922  0.3440636
##   0.4  3          0.8               0.75       100      0.6788927  0.3243204
##   0.4  3          0.8               0.75       150      0.6828611  0.3261296
##   0.4  3          0.8               1.00        50      0.7002067  0.3621206
##   0.4  3          0.8               1.00       100      0.6957688  0.3551191
##   0.4  3          0.8               1.00       150      0.6938985  0.3515336
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 1, eta
##  = 0.4, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and subsample
##  = 0.75.
6) Prediction
xgb.pred.b <- predict(mod_xgb_fitbalance, newdata = TestData) 

We can see that it is indeed a bit more balanced, however there is still a majoritiy of positive values predicted.

7) Diagnosis (balance)

The sensitivity is almost 67%, the specificity is at 70% and the accuracy is almost 70%. The number of false positive is still at 25.

Assess model

For this section we will measure the main parameters of the six analyzed models.

Table 2: Summary table for assess the model
Sensitivity Specificity Accuracy
logistic 0.5200000 0.8965517 0.7831325
logistic_balance 0.7200000 0.6436782 0.6666667
decision_tree 0.3066667 0.9252874 0.7389558
decision_tree_balance 0.7066667 0.5919540 0.6265060
lda 0.5200000 0.9022989 0.7871486
lda_balance 0.7333333 0.6954023 0.7068273
qda 0.5466667 0.8390805 0.7510040
qda_balance 0.6533333 0.7183908 0.6987952
fda 0.5466667 0.8505747 0.7590361
fda_balance 0.7866667 0.6609195 0.6987952
mda 0.5466667 0.8505747 0.7590361
mda_balance 0.7200000 0.7413793 0.7349398
rf 0.4666667 0.8735632 0.7510040
rf_balance 0.7866667 0.6551724 0.6947791
nn 0.5333333 0.8793103 0.7751004
nn_balance 0.7733333 0.6149425 0.6626506
xgb 0.5333333 0.8735632 0.7710843
xgb_balance 0.6666667 0.7011494 0.6907631

We can see that in temrs of sensitivity, the best models are the balanced FDA, Random Forest and Neural Networks, in terms of specificity the unbalanced decision tree, LDA and logistic regression (we were expecting the unbalanced version to perform better in terms of specificity as we have a majority of positive values in the predictions, hence there will be more positive value predicted which will increase the value of the specificity), while in terms of accuracy the unbalanced LDA, logistic and Random Forest (same reasoning as for the specificity). We were quite surprised by the results we have found, we were expecting the XGB to perform better then the random forest, but we can see that it is actually giving lower values both for specificity and accuracy in the unbalanced case, while the balanced case is only worse in terms of sensitivity compared to the balanced random forest.

More details on the evaluation of the models in the next chapter.

Evaluation

Evaluation of results

In this chapter we will assesses the degree to which the model we have chosen meets the business objectives and we will try to determine if there is some business reason why this model is deficient.

The process will be to compare the results with the evaluation criteria we determined in chapter 3.

The business goal of this analysis was to determine whether a client was at risk of not being able to pay back the credit that has been granted to them, as it would mean a loss for the company and the shareholders.

We will determine it by considering that the company will grant a credit only to those who a have a good credit score, which is those who have the response variable positive, and not giving it to those who have a response of zero.

In order to do it, we will have a look at the quantity of false positive that the model generated, as they would be the people to which a credit has been granted but that would not be able to pay back the company, and we will try to make an evaluation of the potential losses that the firm could make in using the specific model, which should be lower than the 10% of the total amount of credits that the company would be willing to accept.

The ones we will look at in the specific are the balanced versions of the neural network, random forest and the xgboot, as they were the ones having the highest performance in terms of all the parameters we are looking at: specificity, sensitivity, accuracy, the others had good values in one, but performed poorly in the other parameters.

RF <- confusionMatrix(as.factor(rf.pred.b), as.factor(TestData$RESPONSE))$table[2,1]

NN <- confusionMatrix(as.factor(nn.pred.b), as.factor(TestData$RESPONSE))$table[2,1]

XGB <- confusionMatrix(as.factor(xgb.pred.b), as.factor(TestData$RESPONSE))$table[2,1]

FP <- data.frame(t(data.frame(RF, NN, XGB)))
names(FP) <- c("False Positive")
FP
##     False Positive
## RF              16
## NN              17
## XGB             25

The table shows the number of false positive instances in the predictions given by each models. As we can see, the lowest value belongs to the random forest, and it’s equal to 16. This means that at least in 16 cases, the model would falsly predict a person belonging to the category that should have a credit granted, while it should not. These cases are risky for the company, as they could result in a default in the payback of the credit and hence in a loss of the company.

However, the models are still quite satisfying, as the false positive are only a low percentage compared to the number of observations that are tested, you can find the values in the following tables.

FP %<>% dplyr::mutate(Model = c("RF", "NN", "XGB"), 
                     FP_Perc = (FP[,1]/nrow(TestData))) %>% dplyr::select("Model", everything())
FP
##   Model False Positive    FP_Perc
## 1    RF             16 0.06425703
## 2    NN             17 0.06827309
## 3   XGB             25 0.10040161

We can see that the 3 models we have chosen have a percentage of false positive that is lower than 10%. However, the test set is quite small, hence we should repeat the testing with more data to make sure that the values are kept this low.

We can calculate the maximum losses that could happen if all the people that belongs to the false positive group will not actually pay back the credit they have been granted.

amount <- data_sel[-val_index,]$AMOUNT #get the amount from the unscaled data corresponding to the test set

fp.rf <- (ifelse(rf.pred.b == 1 & TestData$RESPONSE == 0, 1, 0)) #selecting the false positive observations 
losses.rf <- sum(fp.rf * amount) #calculating the losses 

fp.nn <- (ifelse(nn.pred.b == 1 & TestData$RESPONSE == 0, 1, 0)) #selecting the false positive observations 
losses.nn <- sum(fp.nn * amount) #calculating the losses 

fp.xgb <- (ifelse(xgb.pred.b == 1 & TestData$RESPONSE == 0, 1, 0)) #selecting the false positive observations 
losses.xgb <- sum(fp.xgb * amount) #calculating the losses 

Losses <- data.frame(losses.rf, losses.nn, losses.xgb) #create a df 
Losses <- data.frame(t(Losses)) #transpose df 
names(Losses) <- "Losses" #naming the cols of the df
Losses %<>% dplyr::mutate(Model = c("RF", "NN", "XGB")) %>% dplyr::select(Model, Losses) 
Losses
##   Model Losses
## 1    RF  47380
## 2    NN  56671
## 3   XGB  99591

As we can see, the amounts ranges from 9.959110^{4} to 4.73810^{4}. The random forest model performs the best both in terms of predicting the false positives (the percentage is lower compared to the one of the xgb and nn) and it has the lowest value for the losses. This means that the it probably puts a higher importance on the variable of amount to predict the category of a new person, and tries to minimize the losses as much as possible. It should hence be preferred.

We want to determine whether these losses represent a high percentage of the total amount of credit that would be granted to the people belonging to the test set.

sel <- data_sel[-val_index,] #getting the observations unscaled 
pos <- sel %>% dplyr::filter(RESPONSE == 1) %>% dplyr::select(AMOUNT) #selecting only the amount of the credits that are granted

Losses %<>% dplyr::mutate(Losses_Perc = Losses / sum(pos))
Losses
##   Model Losses Losses_Perc
## 1    RF  47380  0.08826724
## 2    NN  56671  0.10557604
## 3   XGB  99591  0.18553446

As we can see, the model that as the lowest percentage is the random forest and it does meet our criteria for the selection of the model. i.e.: having the losses lower than the 10% of the total amount of the credits that would be granted.

However, we can also say that the percentage of the losses given by the neural network are exceeding the threshold by less than 1%, hence it could be discussed to also use one of these models if it would mean a lower cost for the company in terms of complexity and computation time. This applies only for the neural network and not for the XGB, not only it performed worse, but also was taking quite some time to be fitted. What is more, the random forest allows for a higher degree of interpretation, while the neural network is more used as a black box.

cbind(FP, Losses[,-1])
##   Model False Positive    FP_Perc Losses Losses_Perc
## 1    RF             16 0.06425703  47380  0.08826724
## 2    NN             17 0.06827309  56671  0.10557604
## 3   XGB             25 0.10040161  99591  0.18553446

We would hence suggest to use a random forest model, as it has among the highest sensitivity, lowest amount of false positive predictions and lowest percentage of losses, equal to 0.0882672, while having also a higher degree of interpretability and lower complexity, compared to the other methods that were selected at the end of our modelization chapter.

Moreover, we have seen that not all the variables that are included in the dataset are actually useful for the prediction of the response. This means that the company, when evaluating a new customer, should rather focus on getting the information regarding the variables that have been selected, namely CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS, TELEPHONE. This would mean lower costs for the company, as they would spend less time on getting useless information and less space to store them.

Review the process

Overview

We started our data mining with an exploratory data analysis. We looked at the structure of the dataset that we used, which had 32 variables and 1’000 observations. Then, we had a more detailed look at the output variable and we could conclude that we had a binary with a majority of positive instances. Looking at the independent variables, we could see that the continuous ones were skewed and had different scales. We could also identify some errors in the data that were fixed, while no missing values were founded. In the second part of our EDA we built a few category variables, so that we could diminish the number of variables that we needed to use in the modelling, more specifically we built a binary describing the sex of the person, one categorical for the purpose of the credit, one categorical for the property and another one for the residence. To assess if it made sense to aggregate the variables, chi-squared test were run. To further select the data, we created a simple linear regression and we used the AIC to select only the most significant ones. We were then able to move on to the modelling part, in which we used 6 different models, namely: logistic regression, decision trees, discriminate analysis, random forest, neural network and xgboost. For each of them we fit a model on the unbalanced training set (containing 75% of the data randomly selected) and then compared the predictions it gave to the test set (containing the remaining 25% of the data). We did also a balancing of the dataset, in order to have around the same amount of the positive and negative values for the response, and we fit the same models on the training set based on this data, built in the same way as before, and compared the predictions to the test set. For each model we built a confusion matrix and we took into consideration the accuracy, specificity, and sensitivity, with a higher weight put on the latter, which allowed us to select the models for the evaluation part, in which we considered the false positive amount in the prediction and the losses that would have been associated. The result was a selection of the random forest model, which outperformed all the other ones.

Improvements

We believe that what has been done was an accurate analysis of the data, however some improvements could be done in terms of process performance. More specifically, we could see that the variable created describing the sex of the person was not selected, hence it was not necessary to create it. Moreover, the correlations were calculated but not really used for the selection of the variables, they could have been avoided too. We also believe that the coding could have been executed in a more efficient way, as a lot of repetitions were done, specifically in the modelling part. We could have created either a function for the modelling and use is to diminish the lines of code, or find another way to optimize it, e.g. the use of a different library. However, thanks to the caret package, we were already able to optimize a good part of the code, which would have been even longer and more complicated otherwise. What is more, we could have included different models, as some of the ones we used were elementary and were expected to perform poorly compared to more complex ones, such as the neural network or the random forest. We could have chosen one simple model in order to compare the results and see if the increase in accuracy, sensitivy and specificity was high enough to excuse the increase in complexity, and then only keep the most performing ones and select some others.

In any case, the results we have found are quite satisfying, as we could still find two models that are giving a prediction that is meeting (or almost) our business success criteria.

Next Steps

To improve the process, another model could be selected, maybe one that has not been considered in our analysis. However, we believe that the results that will be given are already satisfying enough.

Another way to improve the model could be to considered other information that has not been considered in our analysis, such as the number of other credits that are pending or the history of (un)repaid credits.

An alternative way could be to gather other information from other credit companies, banks, insurances, etc., so that it is possible to fit a more powerful model.

Decision

With our analysis, the company should be able to assess the quality of a new customer and predict if it should be a good idea to give them a credit or not.

We believe that the company should follow these steps, each time a new customer approaches the firm from now on: 1. Collect the information only regarding the variables that have been selected, namely CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS, TELEPHONE 2. With the information gathered, run a random forest model prediction and determine whether the credit should be granted or not 3. Store the result of the decision 4. In case the credit was given, wait and see if it will be paid back 5. Store the result of the debt settlement 6. Use the new data to fit an upgraded model

Conclusions

Conclusions for the business

Goal: Losses < 10% amount of credit

According to the NASA Technical Reports, human error has been reported as being responsible for 60%‐80% of failure, which means that an automation of the selection process would reduce the error and the risk of the not repayment of the loans. However, the application of this tool must co-exist with experienced staff, because there are some factors that must be taken into account, for example the verification of the documents provided for the application. In addition, there would be an improvement in response times and a decrease of the workload for the staff.

Conclusions for data mining

At the beginning of the project, we asked ourselfs some questions that would have helped us meet our main objective, through which we will provide our conclusions.

Are there any variables that could be grouped?

Yes, we have tested the independence of some variables and we have grouped them. In this case we created some dummy variables, for instance the purpose of the credit variable have 6 differents levels.

Have we used all the original independent variables of the model?

No, at the end we select the 15 variables which brings the most of the information to the model, they are: CHK_ACCT, DURATION, HISTORY, PURPOSE, AMOUNT, SAV_ACCT, EMPLOYMENT, INSTALL_RATE, MALE_SINGLE, GUARANTOR, PROPERTY, OTHER_INSTALL, RESIDENCE, NUM_CREDITS, TELEPHONE

Is the data balanced regarding the answer variable?

No, the data is not balance. At the beginning of the model we detect a greater inclination for the prediction of positives, which means that the data is biased. To correct it, we changed the parameters for the training of each model in order to balance the data and maximize the sensitivity.

Does it make sense to balance the data to avoid the model being biased?

Yes, the new solutions with balanced data, in general, showed greater accuracy and sensitivity.

Accuracy, sensivity or specificity, which we need to be more focus on?

In our case, we focused on maximizing the sensitivity, the positive prediction ratio is key to avoid the prediction of a false positive, this could increases the risk of giving the credit to a client that cannot afford the repayments, therefore, the bank could not collect the interest.

Which model fit better?

As we mention earlier in the evaluation section, we decided to go for the random forest model. It is the model that best manages the trade-off between the accuracy and sensitivity and with the lowest percentage and value of losses.

References

*Classification and Regression Training: CARET R documentation https://www.rdocumentation.org/packages/caret/versions/6.0-86*

*Xie, A. Y. J. J. (2020, April 26). R Markdown: The Definitive Guide. Retrieved from https://bookdown.org/yihui/rmarkdown/*

*Xie, A. Y. J. J. (2020, April 26). R Markdown: Code Chunk chapter https://rmarkdown.rstudio.com/lesson-7.html*